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Description 

I. BACKGROUND OF THE INVENTION 

This application corresponds to U.S. Patent 5 524 

205. 

The present invention relates generally to the field 
of recovery from crashes in shared disk systems, and 
in particular, to the use of logs in such recovery. 

All computer systems may lose data if the computer 
crashes. Some systems, like data base systems, are 
particularly susceptible to possible loss of data from sys- 
tem failure or crash because those systems transfer 
great amounts of data back and forth between disks and 
processor memory. 

The common reason for data loss is incomplete 
transfer of data from a volatile storage system (e.g., 
processor memory) to a persistent storage system (e. 
g., disk). Often the incomplete transfer occurs because 
a transaction is taking place when a crash occurs. A 
transaction generally includes the transfer of a series of 
records (or changes) between the two storage systems. 

A concept that is important in addressing data loss 
and recovery from that loss is the idea of "committing" 
a transaction. A transaction is "committed" when there 
is some guarantee that all the effects of the transaction 
are stable in the persistent storage. If a crash occurs 
before a transaction commits, the steps necessary for 
recovery are different from those necessary for recovery 
if a crash occurs after a transaction commits. Recovery 
is the process of making corrections to a data base 
which will allow the complete system to restart at a 
known and desired point. 

The type of recovery needed depends, of course, 
on the reason for the loss of data. If a computer system 
crashes, the recovery needs to enable the restoration 
of the persistent storage, e.g. disks, of the computer sys- 
tem to a state consistent with that produced by the last 
committed transactions. If the persistent storage crash- 
es (called a media failure), the recovery needs to recre- 
ate the data stored onto the disk. 

Many approaches for recovering data base systems 
involve the use of logs. Logs are merely lists of time- 
ordered actions which indicate, at least in the case of 
data base systems, what changes were made to the da- 
ta base and in what order those changes were made. 
The logs thus allow a computer system to place the data 
base in a known and desired state which can then be 
used to redo or undo changes. 

Logs are difficult to manage, however, in system 
configurations where a number of computer systems, 
called "nodes," access a collection of shared disks. This 
type of configuration is called a "cluster" or a "shared 
disk" system. A system that allows any nodes in such a 
system to access any of the data is called a "data shar- 
ing" system. 

A data sharing system performs "data shipping" by 
which the data blocks themselves are sent from the disk 



to the requesting computer. In contrast, a function ship- 
ping system, which is better known as a "partitioned" 
system, ships a collection of operations to the computer 
designated as the "server" for a partition of the data. The 

s server then performs the operations and ships the re- 
sults back to the requestor. 

In partitioned systems, as in single node or central- 
ized systems, each portion of data can reside in the local 
memory of at most one node. Further, both partitioned 

10 systems and centralized systems need only record ac- 
tions on a single log. Just as importantly, data recovery 
can proceed based solely on the contents of one log. 

Distributed data shipping systems, on the other 
hand, are decentralized so the same data can reside in 

15 the local memories of multiple nodes and be updated 
from these nodes. This results in multiple nodes logging 
actions for the same data. 

To avoid the problem of multiple logs containing ac- 
tions for the same data, a data sharing system may re- 

20 quire that the log records for the data be shipped back 
to a single log that is responsible for recording recovery 
information for the data. Such "remote" logging requires 
extra system resources, however, because extra mes- 
sages containing the log records are needed in addition 

25 to the I/O writes for the log. Furthermore, the delay in- 
volved in waiting for an acknowledgment from the log- 
ging computer can be substantial. Not only will this in- 
crease response time, it may reduce the ability to allow 
several users to have concurrent access to the same 

30 data base. 

Another alternative is to synchronize the use of a 
common log by taking turns writing to that log. This too 
is expensive, as it involves extra messages for the co- 
ordination. 

35 These difficulties are important to address because 
data sharing systems are often preferable to partitioned 
systems. For example, data sharing systems are impor- 
tant for workstations and engineering design applica- 
tions because data sharing systems allow the worksta- 

40 tions to cache data for extended periods which permits 
high performance local processing of the data. Further- 
more, data sharing systems are inherently fault-tolerant 
and load balancing because a multiplicity of nodes can 
access the data simultaneously, manage some local da- 

45 ta themselves, and share other data with other host 
computers and workstations. 

IBM Research report RJ 6649 January 1989, pp. 
1-45 generally discusses recovery methods, and sug- 
gests, in certain circumstances, the possibility of main- 

50 taining redo and undo records separately. 

It is therefore an object of this invention to ease redo 
log management by removing undo information from re- 
do records. 

Another object of this invention is to provide easier 
55 management of undo information by discarding undo in- 
formation at transaction commit. 

Another object of this invention is to minimize the 
information which must be stored to undo transactions 



2 



3 



EP0 465 018B1 



4 



in case of crashes or failures. 

II. SUMMARY OF THE INVENTION 

The present invention avoids the problem of the pri- 
or art by ensuring that sufficient information from redo 
and undo buffers is maintained so that all changes of 
uncommitted transactions can be removed, the chang- 
es from the committed transactions can be recreated, 
and the storage of the undo buffers into undo logs can 
be minimized. Further efficiencies may be maintained 
by keeping a count of actions in a transaction as the ac- 
tions are undone. 

The present invention provides a data processing 
recovery apparatus and method according to claims 1 
and 5 respectively. 

The accompanying drawings, which are incorporat- 
ed in and which constitute a part of this specification, 
illustrate preferred implementations of this invention 
and, together with the accompanying textual descrip- 
tion, explain the principles of the invention. 

III. BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a diagram of a computer system for im- 
plementing this invention; 

Figure 2 is a diagram of a portion of a disk showing 
blocks and pages; 
Figure 3 is a diagram of a redo log; 
Figure 4 is a diagram of an undo log; 
Figure 5 is a diagram of an archive log; 
Figure 6 is a flow diagram for performing a redo op- 
eration; 

Figure 7 is a flow diagram for performing crash re- 
covery; 

Figure 8 is a flow diagram for merging archive logs; 
Figure 9 is a diagram of a Dirty Blocks table; 
Figure 1 0 is a flowdiagram for implementing a write- 
ahead protocol to optimize undo log usage; 
Figure 11 is a diagram of a Compensation Log 
Record; 

Figure 1 2 is a diagram of an Active Transactions 
table; 

Figure 1 3 is a flow diagram for a Transaction Start 
operation; 

Figure 1 4 is a flow diagram for a Block Update op- 
eration; 

Figure 15 is a flow diagram for a Block Write oper- 
ation; 

Figure 16 is a flow diagram for a Transaction Abort 
operation; 

Figure 17 is a flow diagram for a Transaction Pre- 
pare operation; and 

Figure 1 8 is a flow diagram for a Transaction Com- 
mit operation. 



IV. DESCRIPTION OF THE PREFERRED 
IMPLEMENTATIONS 

Reference will now be made in detail to preferred 
5 implementations of this invention, examples of which 
are illustrated in the accompanying drawings. 

A. System components 

10 System 100 is an example of a storage system 
which can be used to implement the present invention. 
System 100 includes several nodes 110, 120, and 130, 
all accessing a shared disk system 140. Each of the 
nodes 110,1 20, and 1 30 includes a processor 1 1 3, 1 23, 
*s and 1 33, respectively, to execute the storage and recov- 
ery routines described below. Nodes 110, 120, and 130 
also each include a memory 118, 128, and 138, respec- 
tively, to provide at least two functions. One of the func- 
tions is to act as a local memory for the corresponding 
20 processor, and the other function is to hold the data be- 
ing exchanged with disk system 140. The portions of 
memory that are used for data exchange are called 
caches. Caches are generally volatile system storage. 
Shared disk system 140 is also called "persistent 
25 storage." Persistent storage refers to non-volatile sys- 
tem storage whose contents are presumed to persist 
when part or all of the system crashes. Traditionally, this 
storage includes magnetic disk systems, but persistent 
storage could also include optical disk or magnetic tape 
30 systems as well. 

In addition, the persistent storage used to imple- 
ment this invention is not limited to the architecture 
shown in Figure 1. For example, the persistent storage 
could include several disks each coupled to a different 
35 node, with the nodes connected in some type of net- 
work. 

Another part of persistent storage is a backup tape 
system 150 which is referred to as "archive storage." 
Archive storage is a term used generally to refer to the 
40 system storage used for information that permits recon- 
struction of the contents of persistent storage should the 
data in the persistent storage become unreadable. For 
example, should shared disk system 140 have a media 
failure, tape system 150 could be used to restore disk 
45 system 1 40. Archive storage frequently includes a mag- 
netic tape system, but it could also include magnetic or 
optical disk systems as well. 

Data in system 100 is usually stored in blocks, 
which are the recoverable objects of the system. In gen- 
50 eral, blocks can be operated upon only when they are 
in the cache of some node. 

Figure 2 shows an example of several blocks 210, 
220, and 230 on a portion of a disk 200. Generally, a 
block contains an integral number of pages of persistent 
55 storage. For example, in Figure 2, block 210 includes 
pages 212, 214, 216, and 218. 
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B. Logs 

As explained above, most data base systems use 
logs for recovery purposes. The logs are generally 
stored in persistent storage. When a node is updating s 
persistent storage, the node stores the log records de- 
scribing the updates in a buffer in the node's cache. 

The preferred implementation of the present inven- 
tion envisions three types of logs in persistent storage, 
but only two types of buffers in each node's cache. The 
logs are redo logs, or RLOGs, undo logs, or ULOGs, 
and archive logs, or ALOGs. The buffers are the redo 
buffers and the undo buffers. 

An example of an RLOG is shown in Figure 3, an 
example of a ULOG is shown in Figure 4, and an exam- 
ple of an ALOG is shown in Figure 5. The organization 
of a redo buffer is similar to the RLOG t and the organi- 
zation of an undo buffer is similar to the ULOG. 

A log sequence number, LSN, is the address or rel- 
ative position of a record in a log. Each log maps LSNs 
to the records in that log. 

1. RLOG 

As shown in Figure 3, RLOG 300 is a preferred im- 
plementation of a sequential file used to record informa- 
tion about changes that will permit the specific opera- 
tions which took place during those changes to be re- 
peated. Generally, those operations will need to be re- 
peated during a recovery scheme once a block has been 
restored to the state at which logged actions were per- 
formed. 

As Figure 3 shows, RLOG 300 contains several 
records 301, 302, and 310, which each contain several 
attributes. TYPE attribute 320 identifies the type of the 
corresponding RLOG record. Examples of the different 
types of RLOG records are redo records, compensation 
log records, and commit-related records. These records 
are described below. 

TID attribute 325 is a unique identifier for the trans- 
action associated with the current record. This attribute 
is used to help find the record in the ULOG correspond- 
ing to the present RLOG record. 

BSI attribute 330 is a "before state identifier." This 
identifier is described in greater detail below. Briefly, the 
BSI indicates the value of a state identifier for the version 
of the block prior to its modification by the corresponding 
transaction. 

BID attribute 335 identifies the block modified by the 
update corresponding to the RLOG record. 

REDO_DATA attribute 340 describes the nature of 
the corresponding action and provides enough informa- 
tion for the action to be redone. The term "update" is 
used broadly and interchangeably in this description 
with the term "action." Actions, in a strict sense, include 
not only record updates, but record inserts and deletes 
as well as block allocates and frees. 

LSN attribute 345 uniquely identifies the current 



record on RLOG 300. As will be explained in detail be- 
low, LSN attribute 345 is used in the preferred imple- 
mentation to control the redo scan and checkpointing of 
the RLOG. LSN 345 is not stored in either RLOG records 
or in blocks in the preferred implementation. Instead, it 
is inherent from the position of the record in the RLOG. 

One goal of this invention is to allow each node to 
manage its recovery as independently of the other 
nodes as possible. To do this, a separate RLOG is as- 
sociated with each node. The association of an RLOG 
with a node in the preferred implementation involves use 
of a different RLOG for each node. Alternatively, the 
nodes can share RLOGs or each node can have multi- 
ple RLOGs. If an RLOG is private to a node, however, 
no synchronization involving messages is needed to co- 
ordinate the use of the RLOG with other RLOGs and 
nodes. 

2. ULOG 

In Figure 4, ULOG 400 is a preferred implementa- 
tion of a sequential file used to record information per- 
mitting operations on blocks to be undone correctly. 
ULOG 400 is used to restore blocks to conditions exist- 
ing when a transaction began. 

Unlike RLOGs, each ULOG and undo buffer is as- 
sociated with a different transaction. Thus ULOGs and 
their corresponding buffers disappear as transactions 
commit, and new ULOGs appear as new transactions 
begin. Other possibilities exist. 

ULOG 400 includes several records 401 , 402, and 
410, which each contain two fields, A BID field 420 iden- 
tifies the block modified by the transaction logged with 
this record. An UNDO_DATA field 430 describes the na- 
ture of the update and provides enough information for 
the update to be undone. 

RLSN field 440 identifies the RLOG record which 
describes the same action for which this action is the 
undo. This attribute provides the ability to identify each 
ULOG uniquely. 

3. ALOG 

In Figure 5, ALOG 500 is a preferred implementa- 
tion of a sequential file used to store redo log records 
for sufficient duration to provide media recovery, such 
as when the shared disk system 140 in Figure 1 fails. 
The RLOG buffers are the source of information from 
which ALOG 500 is generated, and thus ALOG 500 has 
the same attributes as RLOG 300. 

ALOGs are preferably formed from the truncated 
portions of corresponding RLOGs. The truncated por- 
tions are portions which are no longer needed to bring 
the persistent storage versions of blocks up to current 
versions. The records in the truncated portions of the 
RLOGs are still needed, however, should the persistent 
storage version of a block become unavailable and need 
to be recovered from the version of the block on archive 
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storage. 

Similar to RLOG 300, ALOG 500 includes several 
records 501, 502, and 503. Attributes TYPE 520, TID 
525, BSI 530, BID 535, and REDO-DATA 540 have the 
same functions as the attributes in RLOG 300 of the 
same name. LSN 545, like LSN 345 for RLOG 300, iden- 
tifies the ALOG record. 

C. State Identifiers (and Write- Ahead Log Protocol) 

In log-based systems, a log record is applied to a 
block only when the recorded state of the block is ap- 
propriate for the update designated by the log record. 
Thus, a sufficient condition for correct redo is to apply a 
logged transaction to a block when the block is in the 
same state as it was when the original action was per- 
formed. If the original action was correct, the redone ac- 
tion will also be correct. 

It is unwieldy and impractical to store the entire con- 
tents of a block state on a log. Therefore, a proxy value 
or identifier is created for the block state. The identifier 
which is used in the preferred implementation is a state 
identifier, or SI . The SI has a unique value for each block. 
That value identifies the state of the block at some par- 
ticular time, such as either before or after the perform- 
ance of some operation upon the block. 

The SI is much smaller than the complete state and 
can be inexpensively used in place of the complete state 
as long as the complete state can be recreated when 
necessary. An SI is "defined" by storing a particular val- 
ue, called the "defining state identifier" or DSI, in the 
block. The DSI denotes the state of the block in which 
it is included. 

State recreation can be accomplished by accessing 
the entire block stored in persistent storage during re- 
covery and noting the DSI of that block. This block state 
is then brought up to date, as explained in detail below, 
by applying logged actions as appropriate. 

A similar technique is described below for media re- 
covery using the ALOG. Knowing whether a log record 
applies to a block involves being able to determine, from 
the log record, to what state the logged action applies. 
In accordance with the present invention, a block's DSI 
is used to determine when to begin applying log records 
to that block. 

In a centralized or partitioned system, the physical 
sequencing of records on a single log is used to order 
the actions to be redone. That is, if action B on a block 
immediately follows action A on the block, then action 
B applies to the block state created by action A. So, if 
action A has been redone, the next log record to apply 
to the block will be action B. 

Single log systems, such as centralized or parti- 
tioned systems, frequently use LSNs as Sis to identify 
block states. The LSN that serves as the DSI for a block 
identifies the last record in log sequence to have its ef- 
fect reflected in the block. In such systems, the LSN of 
a log record can play the role of an "after state identifier" 



or ASI, which identifies the state of the block after the 
logged action. This is in contrast to a BSI (before state 
identifier) which is used in the present invention in a log 
record as described below. 

s In order to update the DSI and prepare for the next 
operation, it is also necessary to be able to determine 
the ASI for a block after applying the tog record. It is 
useful to be able to derive the ASI from the log record, 
such as from the BSI, so the ASI need not be stored in 

10 log records, although the ASI can indeed be stored. The 
derivation must be one, however, that can be used dur- 
ing recovery as well as during normal operation. Prefer- 
ably, the Sis are in a known sequence, such as the mo- 
notonically increasing set of integers beginning with ze- 

15 ro. In this technique, the ASI is always one greater than 
the BSI. 

When storing the updated block back to persistent 
storage, such as shared disk system 140, a Write- 
Ahead Log (WAL) protocol is used. The WAL protocol 
20 requires that the redo and undo buffers be written to the 
logs in shared disk system 140 before the blocks. This 
ensures that the information necessary to repeat or un- 
do the action is stably stored before changing the per- 
sistent copy of the data. 
25 |f the WAL protocol is not followed, and a block were 
to be written to persistent storage prior to the log record 
for the last update for the block, recovery could not occur 
under certain conditions. For example, an update atone 
node may cause a block containing uncommitted up- 
30 dates to be written to persistent storage. If the last up- 
date to that block has not been stored to the node's 
RLOG, and another transaction on a second node fur- 
ther updates the block and commits, the DSI for the 
block will be incremented. At the moment of commit for 
35 that second transaction, the logged actions for these 
other transactions are forced to the RLOG for the sec- 
ond node. Because the second update was generated 
by a different node, however, the writing of the log 
records for the second transaction does not assure that 
40 the log record for the uncommitted transaction on the 
original node is written. 

If the original node crashes and the log record for 
the uncommitted transaction is never written to the 
RLOG, a gap is created in the ASI -BSI sequencing for 
45 the block. Should the block in persistent storage ever 
become unavailable, for example because of disk fail- 
ure, recovery would fail because the ALOG merge, as 
explained below, requires a known and gapless se- 
quence of Sis. 

so Thus, the WAL protocol is a necessary condition for 
an unbroken sequence of logged actions. It is also a suf- 
ficient condition with respect to block updates. When a 
block moves from one node's cache to another, the WAL 
protocol forces the RLOG records for all prior updates 

55 to the blocks to be changed by the committing transac- 
tion. "Forcing" means ensuring that the records in a 
nodes cache or buffer are stably stored in persistent 
storage. 
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By writing to persistent storage, the WAL protocol 
forces the writing of all records in the original node's 
RLOG up through the log record for the last update to 
the current block. 

D. New Block Allocation 

When a block has been freed, such as during nor- 
mal disk storage management routines, and is later re- 
allocated for further use, its DSI should not be set to 
zero because this activity results in non-unique state 
identifiers. If the DSI were set to zero, several log 
records might appear to apply to a block because they 
would have the same SI. Additional information would 
be needed to determine the correct log record. Thus, 
the DSI numbering used in the previous allocation must 
be preserved uninterrupted in the new allocation. Pref- 
erably, the BSI for a newly allocated block is the ASl of 
the block as it is freed. 

One easy way to achieve uninterrupted SI number- 
ing is to store a DSI in the block as a result of the free 
operation. When the block is reallocated, it is read, per- 
haps from persistent storage, and the normal DSI incre- 
menting is continued. This treats allocation and freeing 
just like update operations. One problem with this solu- 
tion is the necessity of reading newly reallocated blocks 
before using them. To make space management effi- 
cient with a minimum of I/O activity, however, it would 
be desirable to avoid the "read before allocation penalty. 

n 

The present invention gains efficiency by not writing 
the DSI for all unallocated blocks. For blocks not previ- 
ously allocated, the initial DSI is always set at zero. Only 
the DSI for blocks that have been deallocated is stored. 
These DSIs are stored using the records already kept 
by the system for bookkeeping of free space in persist- 
ent storage. Usually such bookkeeping information is re- 
corded in a collection of space management blocks. 

By storing the initial SI for each deallocated block 
with this space management information, the initial Si's 
do not need to be stored in the blocks, thus eliminating 
the read before allocation penalty. On reallocation, the 
BSI for the "allocate" operation becomes the initial SI of 
this previous "free" block. 

Of course, to make this procedure operate correctly, 
blocks containing space management information must 
be periodically written to persistent storage, and one 
node must not be allowed to reallocate blocks freed by 
another node until the freed blocks' existence is made 
known to it via this bookkeeping. Thus, maintaining ini- 
tial Si's for freed blocks does not cause additional read- 
ing or writing of the free space bookkeeping information. 

Although adding SI information for free blocks does 
increase the amount of space management information 
needed in this system, there are two reasons why sys- 
tem efficiency should not suffer too much. First, most of 
the free space is characterized as "never before allocat- 
ed," and thus already has an initial SI of zero. Second, 



the previously used free space is small in most data bas- 
es because data bases are usually growing. Because 
the initial Sis are stored individually only for the reallo- 
cated blocks, the increased storage for Sis should be 
s small. 

Alternatively, the never-before-allocated blocks 
could be distinguished from reallocated ones. The SI for 
the reallocated blocks could then be read from persist- 
ent storage when those blocks are allocated. This would 
10 create a read before allocation penalty, however, al- 
though the penalty would be light for the reasons dis- 
cussed above. 



To understand how the logs can be used in recov- 
ery, it is necessary to understand the different versions 

20 of blocks that may be available after a crash. These ver- 
sions may be characterized in terms of how many of the 
logged updates on how many logs are needed to make 
the available version current. This has obvious impact 
with respect to how extensive or localized recovery ac- 

25 tivity will be. 

For purposes of recovery, there are three kinds of 
blocks. A version of a block is "current" if all updates that 
have been performed on the block are reflected in the 
version. A block having a current version after a failure 

30 needs no redo recovery. When dealing with unpredict- 
able system failures, however, one cannot ensure that 
all blocks are current without always "writing-thru" the 
cache to persistent storage whenever an update occurs. 
This is expensive and is rarely done. 

35 A version of a block is "one -log" if only one node's 
log has updates that have not yet been applied to the 
block. When a failure occurs, at most one node need be 
involved in recovery. This is desirable because it avoids 
potentially extensive coordination during recovery, as 

40 well as additional implementation cost. 

A version of a block is "N-log n if more than one 
node's log can have updates that have not yet been ap- 
plied to it. Recovery is generally more difficult for N-log 
blocks than one-log blocks, but it is impractical when 

45 providing media recovery to ensure that blocks are al- 
ways one-log because this would involve writing a block 
to archive storage every time the block changes nodes. 

2. Redo Recovery 

so 

Without care, some blocks will be N-log at the time 
of a system crash (as opposed to a media failure). The 
preferred implementation of this invention, however, 
guarantees that all blocks will be one-log blocks for sys- 
55 tern crash recovery. This is advantageous because N- 
log blocks can require complex coordination between 
nodes for their recovery. Although such coordination is 
possible since the updates were originally sequenced 



E. Recovery 

15 

1 . Block Versions 
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during normal system operation using distributed con- 
currency control, such concurrency control requires 
overhead which should be avoided during recovery 

All blocks can be guaranteed to be one-log with re- 
spect to redo recovery by requiring "dirty" blocks to be 
written to persistent storage before they are moved from 
one cache to another. A dirty block is one whose version 
in the cache has been updated since the block was read 
from persistent storage. 

If this rule is followed, a requesting node always 
gets a clean block when the block enters the new node's 
cache. Furthermore, during recovery, only the records 
on the log of the last node to change the block need be 
applied to the block. All other actions of other nodes 
have already been captured in the state of the block in 
persistent storage. Thus, all blocks will be one-log for 
redo recovery, so redo recovery will not require distrib- 
uted concurrency control. 

Following this technique does not mean that multi- 
ple logs will never contain records for a block. This tech- 
nique merely ensures that only one node's records are 
applicable to the version of the block in persistent stor- 
age. 

Furthermore, although one-log redo recovery is be- 
ing assumed for system crashes, in order to perform me- 
dia recovery, redo actions on multiple logs may have to 
be applied to avoid writing each block to archive storage 
every time the block moves between caches. Hence, it 
is still necessary in certain circumstances to order the 
logged actions across all the logs to provide recovery 
for N-log blocks. This can be accomplished, however, 
because of the sequential Sis. 

Figure 6 shows a flow diagram 600 of the basic 
steps for a redo operation using the RLOG and the Sis 
described above. The redo operation represented by 
flow diagram 600 would be performed by a single node 
using a single RLOG record applied to a single block. 

First, the most recent version of the block identified 
by the log record would be retrieved from the persistent 
storage (step 610). If the DSI stored in that retrieved 
block is equal to the BSI stored in the log record (step 
620), then the action indicated in the log record is ap- 
plied to the block and the DSI is incremented to reflect 
the new state of the block (step 630). Otherwise, that 
update is not applied to the block. 

The redo operation described with regard to Figure 
6 is possible because the BSIs and ASIs can be deter- 
mined at the time of recovery. Thus one can determine 
for each log which log records need to be redone, and 
this determination can be independent of the contents 
of other logs. The only comparison that needs to be 
made between block DSIs and log record BSIs is one 
of equality. 

The redo operation described with regard to Figure 
6 can be used in recovering from system crashes. An 
example of a procedure of crash recovery is shown by 
the flow diagram 700 in Figure 7. A single node can ex- 
ecute this crash recovery procedure independently of 



other nodes. 

The first step would be for the node to read the first 
RLOG record indicated by the most recent checkpoint 
(step 710). The checkpoint, as described below, indi- 
5 cates the point in the RLOG which contains the record 
corresponding to the oldest update that needs to be ap- 
plied. 

The redo operation shown in Figure 6 is then per- 
formed to see whether to apply the action specified in 
10 that log record to the block identified in that log record 
(step 720). 

If, after performing the redo operation, there are no 
more records (step 730), then crash recovery is com- 
plete. Otherwise, the next record is retrieved from the 
'5 RLOG (step 740), and the redo operation (step 720) 
shown in Figure 6 is repeated. 

If the SI associated with a log record is a monoton- 
ically increasing ASI, the test of whether a log record 
applies to a block in some state is whether this ASI is 
20 the first one greater than the block's DSI. This is suffi- 
cient only for one-log recovery, however, because in that 
case only one log will have records with ASIs that are 
greater than the DSI in the block. 

In the preferred implementation of this invention, 
25 however, each log record includes the precise identity 
of the block state before a logged action is performed. 
As explained above, this is the "before state identifier" 
or BSI. 

30 3. Multiple log redo for media recovery 

Media recovery has many of the same characteris- 
tics as crash recovery. For example, there needs to be 
a stably stored version against which log records are 
35 applied. 

There are also important differences. First, the sta- 
ble version of the block against which the ALOG records 
are applied is the version last posted to archive storage. 
Media recovery is N-log because it involves restor- 
40 jng blocks from archive storage and, as explained 
above, blocks are not written to archive storage every 
time they move between caches. Thus the technique of 
writing blocks to storage to avoid N-log recovery for sys- 
tem crashes cannot be used for media recovery. 
45 Managing media recovery is difficult without merg- 
ing the ALOGs. If the ALOGs are not merged, then the 
recovery involves constant searching for applicable log 
records. In merging ALOGs, there is a substantial ad- 
vantage in using BSIs. 
so Figure 8 shows a procedure 800 for N-log media 
recovery involving the merger of the multiple ALOGs. 
The merging is not based on a total ordering among all 
log records, but on the partial ordering that results from 
the ordering among log records for the same block. At 
55 times there will be multiple ALOGs that have records 
whose actions can be applied to their respective blocks. 
As will be apparent from the description of procedure 
800, it is immaterial which of these actions is applied 
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first during media recovery. 

It is faster and more efficient to permit the multiple 
ALOGs to be merged and applied to the backup data 
base in archive storage in a single pass. This can be 
done if the Sis are ordered properly. That is why, as ex- 
plained above, the Sis are ordered in a known se- 
quence, and the preferred implementation of this inven- 
tion uses Sis that are monotonically increasing. 

Beginning with any ALOG, the first log record is ac- 
cessed (step 810). The Block ID and BSI are then ex- 
tracted from that record (step 820). Next, the block iden- 
tified by the Block ID is fetched (step 830). 

Once the identified block is fetched, its DSI is read 
and compared to the ALOG record's BSI (step 840). If 
the ALOG's BSI is less than the block's DSI, the record 
is ignored because the logged action is already incor- 
porated into the block, and redo is not needed. 

If the ALOG record's BSI is equal to the block's DSI, 
then the logged action is redone by applying that action 
to the block (step 850). This is because the equality of 
the Sis means that the logged action applies to the cur- 
rent version of the block. 

The block's DSI is then incremented (step 860). 
This reflects the fact that the application of the logged 
action has created a new (later) version of the block. 

If the ALOG record's BSI is greater than the block's 
DSI, then it is not the proper time to apply the actions 
corresponding to the log record, and it is instead the 
proper time to apply the actions recorded on other 
ALOGs. Thus, the reading of this ALOG must pause and 
the reading of another ALOG is started (step 870). 

If the other ALOG had been paused previously (step 
880), then control is transferred to step 820 to extract 
the Block ID and the BSI of the log record which was 
current when that log was paused. If the log had not pre- 
viously been paused, then control proceeds as if this 
were the first ALOG. 

After all these steps, or if the other ALOG had never 
been previously paused, a determination is made 
whether any ALOG records remain (step 890). If so, the 
next record is fetched (step 810). Otherwise, the proce- 
dure 800 is ended. 

When an ALOG is paused, there must be at least 
one other ALOG that contains records for the block that 
precede the current one. A paused ALOG with a waiting 
log record is simply regarded as an input stream whose 
first item (in an ordered sequence) compares later than 
the items in the other input streams (i.e., the other 
ALOGs). Processing continues using the other ALOGs. 

The current record of the paused ALOG must be 
able to be applied to the block at some future time be- 
cause the BSI would not be greater than the block's DSI 
without intervening actions on other ALOGs. When this 
occurs, the paused ALOG will be unpaused. 

Not all of the ALOGs will be simultaneously paused 
because the actions were originally done in an order that 
agrees with the SI ordering for the blocks. Thus a merge 
of the ALOGs is always possible. 



4. Redo Management 

a. Safe Point Determination 

s Many checkpointing techniques may be used with 
the present invention to make redo recovery even more 
efficient. For example, a Dirty Blocks table can be cre- 
ated to associate recovery management information 
with each dirty block. This information provides two im- 
10 portant functions in managing the RLOG, and therefore 
the ALOG. First, the recovery management information 
is used in determining a "safe point" that governs RLOG 
scanning and truncation. Second, the information can 
be used in enforcing the WAL protocol for the RLOG as 
is well as for potential undo logs. 

Safe point determination is important to determine 
how much of the RLOG needs to be scanned in order 
to perform redo recovery. The starting point in the RLOG 
for this redo scan is called the "safe point." The safe 
point is "safe" in two senses. First, redo recovery can 
safely ignore records that precede the safe point since 
those records are all already included in the versions of 
blocks in persistent storage. Second, the "ignored" 
records can be truncated from the RLOG because they 
are no longer needed. 

This second feature is not true for combined undo/ 
redo logs. For example, if there were a long transaction 
which generated undo records, truncation may not be 
possible before the check point because the actions in 
the undo records may precede the actions in the redo 
records which have been written to the persistent stor- 
age. This would interfere with truncation. 

Dirty Blocks table 900 is shown in Figure 9. Prefer- 
ably, the current copy of Dirty Blocks table 900 is main- 
tained in volatile storage and is periodically stored into 
persistent storage in the RLOG as part of the check- 
pointing process. Dirty Blocks table entries 910, 911, 
and 912 include a recovery LSN field 920 and a Block 
ID field 930. 

The values in the recovery LSN field 920 identify 
the earliest RLOG record whose action is not included 
in the version of the block in persistent storage. Thus 
the value of LSN field 920 is the first RLOG record that 
would need to be redone. 

The value in Block ID field 930 identifies the block 
corresponding to the recovery LSN. Thus, Dirty Blocks 
table 900 associates with every dirty block the LSN of 
the RLOG record that made the block dirty. 

Another entry in Dirty Blocks table 900 is the LastL- 
SN entry 950. The value for this entry is, for each block, 
the LSNs of the RLOG and ULOG records that describe 
the last update to the block. LSNs are used rather than 
DSIs because it is necessary to determine locations in 
logs. 

LastLSN 950 includes RLastLSN 955 (for the 
RLOG) and a list of ULastLSNs 958 (one for each of the 
ULOGs) which indicate, respectively, how much of the 
RLOG and ULOGs need to be forced when the block is 
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written to persistent storage in order to enforce the WAL 
protocol. Enforcing the WAL protocol thus means that 
all actions incorporated into a block on persistent stor- 
age have both RLOG and ULOG records stably stored. 

RLastLSN 955 and ULastLSN 958 are not included 
in the checkpoint (described below) because their role 
is solely to enforce the WAL protocol for the RLOG and 
ULOG. Hence, in the preferred implementation, these 
entries are kept separate from the recovery LSN to avoid 
storing them with the checkpoint information. 

The earliest LSN for all blocks in a node's cache is 
the safe point for the redo scan in the local RLOG. Redo 
recovery is started by reading the local RLOG from the 
safe point forward and redoing the actions in subse- 
quent records. All blocks needing redo have all actions 
needing to be redone encountered during this scan. 

As explained above, the one-log assumption makes 
it possible to manage each RLOG in isolation. A node 
need only deal with its own RLOG, thus one node's ac- 
tions will never be the reason for a block being dirty in 
some other node's cache. Hence, it is sufficient to keep 
a simple recovery LSN (one that does not name the 
RLOG) associated with each block, where it is under- 
stood that the recovery LSN identifies a record in the 
local RLOG. 

b. Checkpointing 

The purpose of checkpointing is to ensure that the 
determination of the safe point, as described above, can 
survive system crashes. Checkpointing can be com- 
bined with a strategy for managing blocks that permits 
the safe point to move and shrink the part of the log need 
for redo. There are many different techniques for check- 
pointing. One is described below, but should not be con- 
sidered to be a required technique. 

The preferred technique for implementing this in- 
vention is a form of "fuzzy" RLOG checkpointing. It is 
called "fuzzy" because the checkpointing can be per- 
formed without concern for whether a transaction or an 
operation is completed. 

Recovery of a version of the Dirty Blocks table 900 
from the checkpointed information permits a determina- 
tion of where to begin the redo scan. Only blocks in the 
Dirty Blocks table 900 need to be redone because only 
those blocks have actions which have not been stored 
into the persistent storage. As explained above, the 
Dirty Blocks table 900 indicates the earliest logged 
transaction that might need redoing. 

System crash recovery via the RLOG and media re- 
covery via the ALOG will typically have different safe 
points and will be truncated accordingly. In particular, a 
truncated portion of an RLOG may continue to be re- 
quired for media recovery. If so, the truncated portion 
becomes part of the ALOG. 

ALOG truncation uses RLOG checkpoints. An 
RLOG checkpoint determines a safe point which per- 
mits the truncation of the RLOG as of the time of the 



checkpoint. This is because all versions of the data in 
persistent storage are more recent than this safe point, 
or else the point would not be safe. 

To truncate an ALOG, blocks on persistent storages 
5 are first backed up to archive storage. When this is com- 
plete, an archive checkpoint record is written to an 
agreed upon location, e.g., in archive storage, to identify 
the RLOG checkpoints that were current when determi- 
nation of the archive checkpoint began. 
to An ALOG can be truncated at the safe point identi- 
fied by the RLOG checkpoint named in the archive 
checkpoint for media recovery . All persistent storage 
blocks are written to archive storage after that RLOG 
checkpoint was done, and hence reflect all the changes 
'5 made prior to this checkpoint's safe point. During block 
backup, several additional RLOG checkpoints may be 
taken. These do not affect ALOG truncation because 
there is no guarantee that the log records involved have 
all been incorporated into the states of blocks in archive 
storage. Actions that do not need to be redone but that 
are left on an ALOG are detected as not applicable and 
are ignored during the media recovery process. 

Checkpoints are written to the RLOG. To find the 
last checkpoint written to the RLOG, its location is writ- 
ten to the corresponding node's persistent storage in an 
area of global information for the node. The most recent 
checkpoint information is typically the first information 
accessed during recovery. Alternatively, one can search 
the tail of the RLOG for the last checkpoint. 

Checkpoints provide a major advantage of a pure 
RLOG which is that the system has explicit control over 
the size of the redo log and hence the time required for 
redo recovery. If the RLOG were combined with the 
ULOG, a safe point could not be used for log truncation 
for the reason explained above. 

In addition, eliminating undo information from the 
RLOG allows the system to control log truncation by 
writing blocks to persistent storage. RLOG truncation 
never requires the abort of long transactions. This is not 
true when truncating logs containing undo information. 

The system exercises control over the RLOG by 
writing blocks back to their locations in persistent stor- 
age. In fact, this writing of blocks is sometimes consid- 
ered part of the checkpoint. Blocks may also be written 
to persistent storage that have recovery LSNs that are 
older, i.e. , further back in the RLOG. This moves the safe 
point for the RLOG closer to the tail of the log. Log 
records whose operations are included in the newly- 
written block are no longer needed for redo recovery, 
and hence can be truncated. 

Media recovery follows the same basic paradigm 
as system crash recovery. Versions of blocks are re- 
corded stably in the archive storage. As explained 
above, each ALOG is formed from the truncated part of 
one of the RLOGs. The ALOG itself can be truncated 
periodically, based on what versions of blocks are in the 
archive storage. 

With only a DSI stored in a block and not an LSN, 
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it is not possible to know which log was last responsible 
for updating the archive storage block, nor where this 
record is in the RLOG. Thus, the information in the 
blocks is insufficient to determine the proper point to 
truncate the ALOGs or RLOGs. The Dirty Blocks table, 
however, can be used as a guide in truncating the 
RLOG. And an RLOG safe point can be used to estab- 
lish an ALOG safe point. 

F. ULOG Operations 

1. ULOG Management 

In addition to the advantages that separating 
RLOGs from ULOGs has on RLOG operation, there are 
also advantages that such separation has on ULOG op- 
eration. For example, a transaction-specific ULOG can 
be discarded once a transaction commits. Hence, space 
management for ULOGs is simple and undo information 
does not remain for long in persistent storage. 

In addition, as explained below, durably writing un- 
do records to the log can frequently be avoided. An undo 
record need only be written when a block containing un- 
committed data is written to persistent storage. 

One disadvantage of separate ULOGs and RLOGs 
on a redo log is that two logs must be forced when a 
block with uncommitted data is written to persistent stor- 
age in order to satisfy the WAL protocol. In general, how- 
ever, writing blocks with uncommitted data to persistent 
storage should be sufficiently infrequent that the sepa- 
ration of logs provides a net gain, even in performance. 

For N-log undo, multiple nodes can have uncom- 
mitted data in a block simultaneously. A system crash 
would require these transactions to all be undone, which 
may require, for example, locking during undo recovery 
to coordinate block accesses. 

To ensure that all blocks will be one-log with respect 
to undo recovery, no block containing uncommitted data 
from one node is ever permitted to be updated by a sec- 
ond node. This can be achieved through a lock granu- 
larity that is no smaller than a block. A requesting node 
will then receive a block in which no undo processing by 
another node is ever required. Therefore, for example, 
if a transaction from another node had updated a block 
and then aborts, the effect of that transaction has al- 
ready been undone. 

Although one-log undo reduces complexity, the im- 
pact on system performance of N-log undo at recovery 
time is much less than for N-log redo. This is because 
only the small set of transactions that were uncommitted 
at the time of system crash needs undoing. And having 
lock granularity no smaller than a block may substan- 
tially decrease concurrency. 

The technique of the present invention will usually 
avoid the need to write to the ULOG for a short transac- 
tion. This is because it will be rare that a cache slot con- 
taining a block with uncommitted data from any partic- 
ular short transaction wilt be needed. The reasons for 



such rarity is because most short transactions should 
commit or abort prior to their cache slots being needed. 

Should a cache slot to be stolen contain a block with 
uncommitted data, the WAL protocol requires the writing 

s of undo records to all appropriate ULOGs. The WAL pro- 
tocol is enforced for the ULOG by force-writing each, 
ULOG through the records identified by ULastLSN in the 
Dirty Blocks table entry for the block. As explained 
above, ULastLSNs identify the undo records for the last 

io update to the block in each ULOG. 

With the WAL protocol, the information needed to 
store the states of blocks without updates of a transac- 
tion is always durably stored in a transaction's ULOG 
prior to overwriting the persistent storage version of the 

*5 block with the new state. Hence, the state of blocks with- 
out the updates of a transaction is always durable prior 
to transaction commit. This information is either: (i) in 
the block version in persistent storage, (ii) "redo recov- 
erable" from the version in persistent storage using the 

20 RLOG information from preceding transactions, or (Hi) 
undo recoverable from a version produced by (i) or (ii) 
using the undo information which is either logged on the 
ULOG by the WAL protocol for this transaction, or cre- 
ated during redo recovery. 

25 For the blocks on persistent storage that are still in 
a prior state, it is possible to have RLOG records without 
corresponding ULOG records. This is common where 
there is "optional" undo logging. It is also possible to 
have ULOG records without corresponding RLOG 

30 records for such blocks. In this case, the ULOG records 
can be ignored. 

Thus, all actions needing to be undone after redo 
recovery need not be found in the ULOG. Should the 
system crash, the missing undo records need to be gen- 

35 erated from the redo records and blocks' prior states. 
As long as an action depends only on the block state 
and value parameters of the logged action, the genera- 
tion of undo records will be possible because all the in- 
formation available when the action was originally per- 

40 formed is available at this point. 

Actions end up on the ULOG for two reasons: either 
the WAL protocol forces a buffer record to the ULOG 
because the block was written to persistent storage, or 
the writing of the ULOG for WAL enforcement results in 

45 the writing of preceding ULOG records and, in some 
cases, following ULOG records that are in the undo buff- 
er. 

For these actions, it is not necessary to generate 
undo records during recovery because these records 

50 are guaranteed to be on a ULOG. This is important be- 
cause it might not be possible to construct the ULOG 
record for the redo-logged transaction because the ver- 
sion of the block in persistent storage has a state that 
comes after the action. Fortunately, it is exactly these 

55 blocks for which ULOG records already exist. 

During redo, the missing undo records would be 
generated. By the end of redo, the union of generated 
undo records and undo records on the ULOGs would be 
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capable of rolling back all uncommitted transactions. 

2. ULOG Optimization 

With the present invention, the use of the ULOG can s 
be optimized by making sure that the contents of an un- 
do log buffer are written to a ULOG only when neces- 
sary. In general, the undo buffer need only be stored to 
a ULOG when a block containing uncommitted data 
from a current transaction is written to persistent stor- 
age. If the transaction has been committed, there will be 
no need to undo the updates in the transaction, and thus 
the undo buffer can be discarded. 

Figure 10 shows a flow diagram 1000 of a proce- 
dure for implementing this ULOG optimization using the 
WAL protocol. It assumes that a version of the block is 
to be written to the persistent storage. 

If the block to be written contains uncommitted data 
(step 1010), then the redo buffer needs to be written to 
the RLOG in the persistent storage, and any undo buff- 
ers are written to ULOGs in the persistent storage (step 
1020). 

After writing the redo buffers to the RLOG and the 
undo buffers to the ULOGs (step 1020), or if the blocks 
did not contain uncommitted data (step 1 01 0), the block 
is written to the persistent storage (step 1030). This is 
in accordance with the WAL protocol. 

Thus, the undo buffers are only written if there is 
uncommitted data to be stored. Each time a transaction 
commits, the corresponding undo log buffer can be dis- 
carded since it need not ever be written to the persistent 
storage. Furthermore, the ULOG itself for the transac- 
tion may be discarded as undo is now never required. 

A committed transaction is made durable by the re- 
cording of all the redo records for the transaction in the 
RLOG in persistent storage. The updated block can be 
written to the persistent storage at some later time. Even 
if there were a crash before the updated block were writ- 
ten, the RLOG could be retrieved to restore the state of 
the block, the system knows a transaction is committed 
by storing a commit record in the RLOG. 

3. Transaction Aborts 

A ULOG can thus be discarded when a transaction 
commits, as undoing the effects of a transaction is no 
longer required. For transaction abort, the situation is 
somewhat different. Before the ULOG records for a 
transaction can be discarded, it is necessary to ensure 
that all blocks changed by an aborting transaction not 
only have their changes undone, but also that the result- 
ing undone block states are durably stored somewhere 
other than in a ULOG. Either the blocks themselves in 
their undone state must be written to persistent storage 
(called a "FORCE" abort), or the undo transactions must 
be written and forced to the RLOG (called a "NO- 
FORCE" abort). Similar to committing transactions, log- 
ging actions on the RLOG obviates the need to force 



blocks to persistent storage in this case, 
a. NO-FORCE Abort 

A NO-FORCE abort can be realized by treating the 
undo operations as additional actions of the aborting 
transaction which reverse the effect of the previous up- 
dates. Such "compensating" actions are logged on the 
RLOG as "compensation log records" (CLRs). 

Compensation log records are effectively undo 
records moved to the RLOG. Extra information is re- 
quired, however, to distinguish these records from other 
RLOG records. In addition, an SI is needed to sequence 
the CLR correctly with respect to other logged transac- 
tions to be redone. 

Figure 11 shows a CLR 1100 with several attributes. 
TYPE attribute 1110 identifies this log record as a com- 
pensation log record. 

TID attribute 1120 is a unique identifier for the trans- 
action. It helps in finding the ULOG record correspond- 
ing to this RLOG CLR. 

BSI attribute 1130 is the before state identifier, as 
described above. In this context, BSI attribute 1130 
identifies the block state at the time that the CLR is ap- 
plied. 

BID attribute 1140 identifies the block modified by 
the action logged with this record. 

UNDO_DATA attribute 1 1 50 describes the nature of 
the action to be undone and provides enough informa- 
tion for the action to be undone after its associated orig- 
inal action has been incorporated into the block state. 
The value for the UNDO_DATA attribute 1150 comes 
from the corresponding undo record stored either in a 
ULOG or in an undo buffer. 

RLSN attribute 1160 is the RLOG record which de- 
scribes the same action for which this action is the undo. 
This attribute comes from the RLSN attribute 440 of the 
ULOG record. 

LSN 1170, which need not be stored explicitly be- 
cause it may be identified by its location in the RLOG, 
identifies this CLR uniquely on the RLOG. The LSN is 
used to control the redo scan and checkpointing of the 
RLOG. 

As with transaction commit, when a transaction 
aborts, all redo records describing the actions of the 
transaction should be written to the RLOG. For the 
aborted transaction, this includes the undo actions in the 
CLRs. For a commit, the RLOG is forced to ensure that 
all redo records for the transaction are stably stored. For 
abort, this is not strictly necessary. The needed informa- 
tion still exists on the ULOG. However, the ULOG cannot 
be discarded until CLRs for the aborting transaction 
have been durably written to the RLOG. The CLRs on 
the RLOG will then substitute for the ULOG records. 

A desirable property of the NO-FORCE approach 
is that for media recovery, only the redo phase is need- 
ed. Updates are applied in the order that they are proc- 
essed during the ALOG merging. No separate undo 
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phase is required while processing the ALOG because 
any needed undo is accomplished by applying CLRs.. 

A second table, called the Active Transactions ta- 
ble, records the information needed to effect undo op- 
erations. Like the Dirty Blocks table 900, the Active 
Transactions table becomes part of the checkpoint in- 
formation on the RLOG so that its information is pre- 
served if the system crashes. 

The Active Transactions table indicates transac- 
tions that may need to be undone, the state of the undo/ 
redo logging, and the undo progress. Enough informa- 
tion must be encoded in the Active Transactions table 
to ensure recovery from all system crashes, including 
those that occur during recovery itself. Some informa- 
tion which improves recovery performance may also be 
included. 

Figure 12 shows an example of an Active Transac- 
tions table 1200. Table 1200 includes records 1201, 
1202, and 1207. Each of the records includes several 
attributes. 

Tl D attribute 1 21 0 is a unique identifier for the trans- 
action. It is the same as the transaction identifier used 
for RLOG records. 

STATE attribute 1220 indicates whether an active 
transaction is "prepared" as part of a two-phase commit. 
A two-phase commit is used when multiple nodes take 
part in a transaction. To commit such a transaction, all 
the nodes must first prepare the transaction (phase 1) 
before they can commit it (phase 2). The preparation is 
done to avoid partial commits which would occur if one 
node commits, but another aborts. A prepared transac- 
tion needs to be retained in the Active Transactions ta- 
ble 1200 because it may need to be rolled back. Unlike 
a non-prepared transaction, a prepared transaction 
should not be automatically aborted. 

ULOGIoc attribute 1 230 indicates the location of the 
transaction-specific ULOG. This attribute need only be 
present should there be no other way to find the ULOG. 
For example, the TID 1210 might provide a substitute 
way of finding the ULOG for the transaction. 

HIGH attribute 1 240 indicates the RLOG LSN of the 
action which is the last action with an undo record written 
to the ULOG for this transaction. This ULOG record con- 
tains an RLOG LSN in RLSN such that RLOG records 
that follow RLSN need to be generated during redo after 
a system crash in order to be ready to roll back the trans- 
action should it not have been committed. 

NEXT attribute 1 250 indicates the RLOG LSN of the 
next action in the transaction that needs to be undone. 
For transactions that are not being rolled back, NEXT 
attribute 1250 is the record number for the last action 
performed by the transaction. 

Although some systems undo CLRs during recov- 
ery, they are not undone in the preferred embodiment. 
Instead, CLRs are tagged [via the TYPE attribute] so 
they can be identified during recovery. 

Because of the sequential nature of the ULOG, 
when an undo record is forced to a ULOG, all preceding 



undo records are also guaranteed to be durable. RLOG 
records are written in the same order as the ULOG 
records. Hence, if an RLOG record is found that does 
not need redo, for example because its effect is already 
s in the version of the block on persistent storage, then all 
preceding RLOG records have undo records in the 
ULOG. This occurred because the ULOG was forced 
when the block was written, hence all prior ULOG 
records were written at the same time. If undo records 
io have been generated during redo for this transaction, 
they can be discarded as all such prior records must al- 
ready exist in the ULOG. 

The RLOG LSN of the last RLOG record for which 
a ULOG record was written is stored in HIGH attribute 
is 1240 (Figure 12) of the Active Transactions table entry 
for the transaction. RLOG records that precede this in- 
dicated redo log record do not generate undo records 
during redo because they all have ULOG records al- 
ready. RLOG records following the one denoted by 
20 HIGH may need to have undo records generated. 

Undo record generation can also be avoided if the 
number of undo records that has already been applied 
for each transaction is carefully monitored. Hence, the 
undo "high water mark" is encoded in the NEXT attribute 
25 1 250 of Active Transactions table 1200. The NEXT at- 
tribute 1250 contains the record number of the next un- 
do record to be applied for the transaction. 

During normal processing, the NEXT attribute 1 250 
is always the record number for a transaction's most re- 
30 cent transaction. The value in the NEXT attribute 1250 
is incremented as these actions are logged. During undo 
recovery, the value in the NEXT attribute 1 250 is decre- 
mented after every undo action is applied and its CLR 
is logged, naming its predecessor undo record as the 
35 next undo action. Should a system crash occur during 
rollback, undo records with record numbers higher than 
that indicated by the NEXT attribute 1250 need not be 
re-applied, and hence need not be generated again dur- 
ing redo. 

40 The end result is that during redo, undo records are 
generated for the RLOG records whose record numbers 
fall in between the values for HIGH attribute 1240 and 
NEXT attribute 1250. Whenever the value of HIGH at- 
tribute 1240 is greater than or equal to the value of the 
45 NEXT attribute 1250, no undo records need be gener- 
ated at all. 

b. Force Abort 

so With the "FORCE" abort, CLRs are not written. In- 
stead, when blocks are undone, the blocks themselves 
are forced to persistent storage. In this type of abort, the 
need is to stably retain knowledge that a block includes 
the result of applying an undo record, as well as the se- 
ss quence in which the undo operations were performed, 
without writing a CLR for it. 

The goal is to support N-log undo where several 
nodes may undo transactions on a single block as a re- 
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suit of a system crash. Hence, the progress of undo op- 
erations performed by each node must be stably record- 
ed. This is what CLRs accomplish in the NO-FORCE 
case. Without CLRs, some other technique is required. 

One alternative is to write the needed information s 
into the block going to persistent storage. Although a 
CLR contains a complete description of the undo action, 
not all of this description is needed. What is needed in 
the FORCE abort case is to record the results of the un- 
do transactions and which of them have been undone. 10 

G. Normal Operations 

During normal operation, Transaction Start, Block 
Update, Block Write, Transaction Abort, Transaction is 
Prepare, and Transaction Commit operations have an 
impact on recovery needs. Hence, during normal oper- 
ation, steps must be taken with respect to logging to as- 
sure that recovery is possible. 

Figure 13 contains a procedure 1300 for a Trans- 20 
action Start Operations. First, a 
START_TR AN S ACTION record must be written to the 
RLOG (step 1310). Next, the transaction is entered into 
the Active Transactions table 1200 in the "active" state 
(step 1 320). Then the ULOG for the transaction and its 2s 
identity are recorded in ULOGIoc 1230 (step 1330). Fi- 
nally, the HIGH 1240 and NEXT 1250 values are set to 
zero (step 1 340). 

Figure 14 shows a procedure 1400 for Block Update 
operation. First, the required concurrency control re- 30 
quired is performed to lock the block for update (step 
1410). The block is then accessed from persistent stor- 
age if it is not already in cache (step 1 420). The indicated 
transaction is then performed upon the version of the 
block in cache (step 1430). Next, the block's DSI is up- 35 
dated with the AS I for the action (step 1440). Then, both 
RLOG and ULOG records are constructed for the up- 
date and are posted to their appropriate buffers (step 
1 450). The LastLSNs 950 (Figure 9) are updated appro- 
priately (step 1460). Then the NEXT 1250 value is set 40 
to the ULOG LSN of the undo record for this action (step 
1470). 

If the block was clean (step 1475), it is made dirty 
(step 1480). It is then put into the Dirty Blocks table 900 
(Figure 9), with the recovery LSN 920 set to the LSN of 45 
the RLOG record for it (step 1 485). 

Figure 15 contains a flow diagram 1500 for a Block 
Write operation if the block contains uncommitted data. 
First the WAL protocol is enforced (step 1510). Specif- 
ically, prior to writing the block to persistent storage, all so 
undo buffers are written up to the corresponding Las- 
tULSN 958 (Figure 9) for the block, and the RLOG buffer 
is written up to the LastRLSN 955 (Figure 9). For each 
transaction identified in the LastULSNs for the block, set 
HIGH for these transactions to the RLOG LSN values in ss 
the RLSN attributes of the undo records identified by the 
Last U LSN attributes from the Dirty Blocks Table. Each 
LastULSN must identify both a transaction via a TID and 



a ULOG LSN. For these logs, there are times when no 
writing need be done because these records have al- 
ready been written. 

The block is then removed from the Dirty Blocks ta- 
ble 900 (step 1 520), and the block is written to persistent 
storage (step 1530). A block-write record may then be 
written to the RLOG to indicate that the block has been 
written to persistent storage, but this is optional. This 
block-write record need not be forced. 

Figure 1 6 contains a flow diagram 1 600 for a Trans- 
action Abort operation. First, the undo record indicated 
by the value in NEXT- field 1250 is located (step 1610). 
Then the required concurrency control is performed on 
the blocks involved exactly as if they were being proc- 
essed by normal updates (step 1620). 

Next the current undo log record is applied to its 
designated block (step 1630), and a CLR for the undo 
action is written in the RLOG (step 1640). The value of 
NEXT field 1250 is then decremented to index the next 
undo log record to be applied (step 1640) as the "cur- 
rent" undo record. 

If any undo log records remain for the transaction 
(step 1660), control is returned to step 1610. Otherwise 
an ABORT record is placed on the RLOG (step 1670). 
The RLOG is then stored to persistent storage up 
through the ABORT record (step 1680). The ULOG is 
then discarded (step 1690). Finally, the transaction is 
removed from the Active Transaction table 1200 (Figure 
12) (step 1695). 

Figure 17 shows a flow diagram 1700 for a Trans- 
action Prepare operation. First, a prepare log record for 
the transaction is written to the RLOG (step 1710). Next, 
the RLOG is forced up through this prepare log record 
(step 1720). Finally, the state of the transaction being 
"prepared" is changed in the Active Transaction table 
1200 (step 1730). 

Figure 1 8 shows a flow diagram 1 800 for a Trans- 
action Commit operation. First, a commit log record for 
the transaction is written to the RLOG (step 1810). Next, 
the RLOG is forced up through this record (step 1820). 
Then the ULOG is discarded (step 1830). Finally, the 
transaction is removed from the Active Transaction table 
1200 (step 1880). 

H. System Crash Recovery Processing 

In the preceding discussion, various aspects of 
logs, state identifiers, and recovery have been dis- 
cussed. They can be combined into an effective recov- 
ery scheme in different methods. The preferred method 
is described below. 

I. Analysis phase 

An analysis phase is not strictly necessary. Without 
an analysis phase, however, some unnecessary work 
may be done during the other recovery phases. 

The purpose of the analysis phase is to bring the 
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system state as stored in the last checkpoint up to the 
state of the data base at the time the system crashed. 
To do this, the information in the last complete check- 
point on the RLOG is read and used to initialize the val- 
ues for the Dirty Blocks table 900 (Figure 9) and Active 5 
Transactions table 1200 (Figure 12). RLOG records fol- 
lowing this last checkpoint are then read. The analysis 
phase simulates the logged actions in their effects on 
the two tables. 

With regard to the specific records, Start Transac- io 
tion records are treated exactly like a start transaction 
operation with respect to the Active Transactions table. 
An Update Log records are treated exactly like a block 
update with respect to the Dirty Blocks table 900 and 
Active Transactions table 1200, but the update is not is 
applied. Compensation Log records are treated exactly 
like a block update with respect to the Dirty Blocks table 
900 and Active Transactions table 1 200, except the val- 
ue of the NEXT attribute 1250 is decremented, and the 
update is not applied. 20 

For block-write records the block is removed from 
the Dirty Blocks Table 900. For Abort Transaction 
records, the transaction is deleted from the Active 
Transactions table 1200. For Prepare Transaction 
records, the state of the transaction in the Active Trans- 25 
actions table 1200 is set to "prepared." For Commit 
Transaction records, the transaction is deleted from the 
Active Transactions table 1200. 

To restore the HIGH attribute 1240 for the transac- 
tions in Active Transactions table 1 200, the ULOG must 30 
be accessed to find the RLSN attribute of the last record 
written to the ULOG. This LSN becomes the value for 
HIGH attribute 1240. Alternatively, the value for HIGH 
attribute 1240 from the checkpoint can be used or up- 
dated. This RLOG LSN can be used to avoid generating 35 
undo records for actions that are already recorded on 
the ULOG for a transaction. Only RLOG records for a 
transaction that follows this value needs to have undo 
information generated. 

NEXT attribute 1250 is then either (1) the RLOG 40 
LSN of the last action whose log record is written to the 
RLOG for the transaction if that log record is for an up- 
date, or (2) the RLSN attribute of the last CLR written 
for the transaction. Thus the NEXT attribute 1250 can 
be restored during the analysis pass of the RLOG. 45 
NEXT attribute 1250 identifies, via the RLSN value in 
the ULOG records, the next undo record to be per- 
formed. It can also be used to avoid generating undo 
records for actions that have already been compensat- 
ed by having CLRs written to undo them. Thus, redo so 
records for a transaction with RLOG LSNs greater than 
the NEXT attribute 1 250 for the transaction in the Active 
Transactions table 1200 do not need to have undo in- 
formation generated for them as undo will be done when 
the existing CLRs are applied during the redo phase of ss 
recovery. 



2. The redo phase 

In the redo phase, all blocks indicated as dirty in the 
reconstructed Dirty Blocks table 900 are read into the 
cache. This read can be done in bulk, overlapped with 
the scanning of the RLOG. 

Some blocks may be read by several nodes to de- 
termine whether they need to be involved in local redo, 
but only one of the nodes will actually perform redo for 
a block. This can, however, be almost completely avoid- 
ed by writing block-write records to the RLOG. Because 
block-write records need not be forced, a block will oc- 
casionally be read from persistent storage when this is 
not necessary. The penalty for such a read is small, how- 
ever. 

A one-log version of every block exists in persistent 
storage, so only one node can have records in its log 
that have a BSI equal to the DSI of the block. This node 
is the one that will independently perform redo process- 
ing on the block. Hence, redo can be done in parallel by 
the separate nodes of the system, each with its own 
RLOG. No concurrency control is needed here. 

The redo phase reconstructs the state of the node's 
cache by accessing the dirty blocks needing redo and 
posting the changes as indicated in the RLOG records. 
The resulting cache contains the dirty blocks in their 
states as of the time of the crash. Blocks that were sub- 
ject to redo have been locked. The resulting Dirty Blocks 
table 900 and Active Transactions table 1200 are simi- 
larly reconstructed. Blocks that were subject to redo 
have been locked. 

Only redo records for dirty blocks as indicated in the 
Dirty Blocks table 900 after the analysis phase may 
need to be redone. The redo scan of the RLOG starts 
at the earliest recovery LSN 920 recorded in the Dirty 
Blocks table 900. This is the safe point for redo. Hence, 
all updates to every block since it was written to persist- 
ent storage are assured of being included in the redo 
scan. 

As explained above, there are only two cases that 
can arise when trying to apply an RLOG record to its 
corresponding block. If the RLOG record's BSI is not 
equal to the block's DSI, the logged action can be ig- 
nored. If instead the RLOG record's BSI is equal to the 
blocks's DSI, the appropriate redo activity is performed. 

The redo phase method involves repeating history. 
All update RLOG records, starting with the RLOG record 
denoted by a block's recovery LSN, are applied, even 
those that belong to transactions that will need to be un- 
done subsequently. The principle here is that for an ac- 
tion to be redone, it needs to be applied to the block in 
exactly the state to which the original action was applied. 

In the application of an RLOG record to a block, the 
block's DSI is updated to the ASI for the redone action. 
The node requests an appropriate lock on the block 
when an RLOG action is applied. Redo need not wait 
for the lock to be granted, since no other node will re- 
quest a lock. The requested locks must, however, be 
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granted prior to the start of undo. This is the way con- 
currency control is initialized for the undo phase. 

If a normal update logged on the RLOG needs redo, 
an ULOG record may need to be generated for it. All 
RLOG redo records for a transaction with LSNs between 
the HIGH and NEXT values will have undo information 
generated for them. That information preferably in- 
cludes ULOG records with RLSN attributes that identify 
these records. 

If an action is not required to be redone, earlier undo 
records that may have been generated inappropriately 
are discarded because the ULOG has been written, via 
the WAL protocol, to persistent storage up to the ULOG 
record for this action. The HIGH attribute 1240 can be 
updated at this time with the RLOG LSN of this record 
which will, should a checkpoint be taken, reduce the re- 
dundant undo record generation during subsequent re- 
covery should the current recovery process fail. 

For each transaction, generated undo records are 
stored in the transaction's ULOG buffer. These undo 
records, plus those on its ULOG and its CLRs, ensure 
that an active transaction can be rolled back. Hence, at 
the end of the redo phase, all necessary undo log 
records will exist. 

3. The undo phase 

Undo recovery is N-log. Hence, the undo recovery 
phase needs concurrency control in the same way that 
it is needed during transaction rollback. Multiple nodes 
may need to undo changes to the same block. Normal 
data base activity can resume once the undo phase be- 
gins, however, just as normal activity can proceed con- 
currently with transaction abort. All the appropriate lock- 
ing is in place to permit this. This is ensured by not start- 
ing the undo phase until all nodes have completed the 
redo phase. Hence, all locks requested by any node dur- 
ing redo are held by the appropriate node prior to undo 
beginning. 

First all active transactions (but not prepared trans- 
actions) in Active Transactions table 1200 are rolled 
back. Undo processing proceeds exactly as in rolling 
back explicitly aborted transactions, with one exception. 
Some undo records might be present both in an undo 
buffer, where they were regenerated during redo, and in 
a ULOG in persistent storage. These duplicate undo 
records can be detected and ignored. This can be en- 
capsulated in a routine to get the next undo record, so 
that the remainder of the code to undo transactions ac- 
tive at time of crash can be virtually identical to the code 
needed to undo a transaction when the system is oper- 
ating normally. Redundant ULOG records among these 
sources can be eliminated because all undo records are 
identified by the LSN of the RLOG record to which they 
apply. 



V. CONCLUSION 

The use of separate RLOGs and ULOGs permits 
the optimization of logging operation by making sure 

5 that the undo information is only stored to a ULOG when 
absolutely necessary. The test for when such a neces- 
sity arises is whether all the information needed for 
changes involved in uncommitted transactions has 
been stored or can be recreated. 

10 Further optimization can be obtained by keeping 
counts of the changes made during recovery. 

It will be apparent to persons of ordinary skill in the 
art that modifications and variations can be made with- 
out departing from the scope of this invention as defined 

'5 in the appended claims. For example, the architecture 
shown in Figure 1 may be different, and the number of 
undo and redo logs assigned to each node can vary. 



20 Claims 

1 . A data processing recovery apparatus comprising : 

a redo buffer containing a set of redo records, 
2S said redo buffer including information for com- 

mitted and uncommitted transactions; 
an undo buffer containing a set of undo records, 
said undo buffer including information only for 
an uncommitted transaction, said undo records 
30 being aggregated in said undo buffer separate- 

ly from said redo records in said redo buffer; 
and 

a log management routine for starting an un- 
committed transaction, recording redo records 

35 corresponding to said uncommitted transaction 

in said redo buffer, recording undo records for 
said uncommitted transaction in said undo buff- 
er, committing said transaction, storing said re- 
do records corresponding to said committed 

40 transaction from said redo buffer to persistent 

storage, and for separately discarding said un- 
do records corresponding to said committed 
transaction from said undo buffer while retain- 
ing said redo records in said redo buffer 

45 

2. The data processing recovery apparatus as claimed 
in claim 1 further including an active transactions 
table stored in a memory and containing entries cor- 
responding to transactions which have not been 

so committed. 

3. The data processing recovery apparatus as claimed 
in claim 2 further including a means for removing 
from said active transactions table an entry corre- 

55 spending to a first transaction after said first trans- 
action is committed. 

4. The data processing recovery apparatus as claimed 
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in claim 2 further including a means for storing the 
contents of said undo buffer in said persistent stor- 
age prior to storing changes from corresponding un- 
committed transactions. 

5. A method for data processing recovery comprising 
the steps of: 

providing a redo buffer containing a set of redo 
records, said redo buffer including information 
for committed and uncommitted transactions: 
providing an undo buffer containing a set of un- 
do records, said undo buffer including informa- 
tion only for an uncommitted transaction, said 
undo records being aggregated in said undo 
buffer separately from said redo records in said 
redo buffer; 

starting an uncommitted transaction; 
recording redo records corresponding to said 
uncommitted transaction in said redo buffer; 
recording undo records for said uncommitted 
transaction in said undo buffer; 
committing said transaction; 
storing said redo records corresponding to said 
committed transaction from said redo buffer to 
persistent storage; and 

separately discarding said undo records corre- 
sponding to said committed transaction from 
said undo buffer while retaining said redo 
records in said redo buffer. 

6. The method as claimed in claim 5 further including 
the step of providing an active transactions table 
stored in a memory and containing entries corre- 
sponding to transactions which have not been com- 
mitted. 

7. The method as claimed in claim 6 further including 
the step of removing from said active transactions 
table an entry corresponding to a first transaction 
after said first transaction is committed. 



Patentanspruche 

1. Datenverarbeitungswiedergewinnungsgerat mit: 

einem Redo-Puffer, der einen Satz von Redo- 
Aufzeichnungen enthalt, wobei der Redo-Puf- 
fer Information fur quittierte und unquittierte 
Transaktionen umfaBt, 

einem Undo-Puffer, der einen Satz von Undo- 
Aufzeichnungen enthalt, wobei der Undo-Puf- 
fer Information nur fur eine unquittierte Trans- 
aktion enthalt und die Undo-Aufzeichnungen in 
dem Undo-Puffer getrennt von den Redo-Auf- 
zeichnungen in dem Redo-Puffer angesam- 
melt sind, und 



einer Logbuch-Managementroutine zum Star- 
ten einer unquittierten Transaktion, zum Auf- 
zeichnen von Redo-Aufzeichnungen entspre- 
chend der unquittierten Transaktion in dem Re- 

5 do-Puffer, zum Aufzeichnen von Undo-Auf- 

zeichnungen der unquittierten Transaktion in 
dem Undo-Puffer zum Quittieren der Transak- 
tion, zum Speichern der Redo-Aufzeichnungen 
entsprechend der quittierten Transaktion von 

10 dem Redo-Puffer zu einem Dauerspeicher und 

zum getrennten Loschen der Undo-Aufzeich- 
nungen entsprechend der quittierten Transak- 
tion von dem Undo-Puffer, wahrend die Redo- 
Aufzeichnungen in dem Redo-Puffer zuruckge- 

15 halten sind. 

2. Datenverarbeitungswiedergewinnungsgerat nach 
Anspruch 1, weiterhin mit einer aktiven Transakti- 
onstabelle, die in einem Speicher gespeichert ist 
und Eingaben entsprechend Transaktionen enthalt, 
die nicht quittiert wurden. 

Datenverarbeitungswiedergewinnungsgerat nach 
Anspruch 2, weiterhin mit einer Einrichtung, urn aus 
der aktiven Transaktionstabelle eine Eingabe ent- 
sprechend einer ersten Transaktion zu entfernen, 
nachdem die erste Transaktion quittiert ist. 

Datenverarbeitungswiedergewinnungsgerat nach 
Anspruch 2, weiterhin mit einer Einrichtung zum 
Speichern der Inhalte des Undo-Puffers in dem 
Dauerspeicher vor einem Speichern von Anderun- 
gen von entsprechenden unquittierten Transaktio- 
nen. 

Verfahren zur Datenverarbeitungswiedergewin- 
nung mit den folgenden Schritten: 

Vorsehen eines Redo-Puffers, der einen Satz 
von Redo-Aufzeichnungen enthalt, wobei der 
Redo-Puffer Information fur quittierte und un- 
quittierte Transaktionen umfaGt, 
Vorsehen eines Undo-Puffers, der einen Satz 
von Undo-Aufzeichnungen enthalt, wobei der 
Undo-Puffer Information lediglich fur eine un- 
quittierte Transaktion umfa3t und die Undo- 
Aufzeichnungen in dem Undo-Puffer getrennt 
von den Redo-Aufzeichnungen in dem Redo- 
Puffer angesammelt sind, 
Starten einer unquittierten Transaktion, 
Aufzeichnen von Redo-Aufzeichnungen ent- 
sprechend der unquittierten Transaktion in dem 
Redo-Puffer, 

Aufzeichnen von Undo-Aufzeichnungen fur die 
unquittierte Transaktion in dem Undo-Puffer, 
Quittieren der Transaktion, 
Speichern der Redo-Aufzeichnungen entspre- 
chend der quittierten Transaktion von dem Re- 
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do-Puffer zu einem Dauerspeicher und 
getrenntes Loschen der Undo-Aufzeichnungen 
entsprechend der quittierten Transaktion von 
dem Undo-Puffer, wahrend die Redo-Aufzeich- 
nungen in dem Redo-Puffer zuruckgehalten 5 
sind. 

6. Verfahren nach Anspruch 5, weiterhin umfassend 
den Schritt des Vorsehens einer aktiven Transakti- 
onstabelle, die in einem Speicher gespeichert ist, 10 
und Eingaben entsprechend Transaktionen enthalt, 

die nicht quittiert wurden. 

7. Verfahren nach Anspruch 6, weiterhin mit dem 
Schritt des Entfernens einer Eingabe entsprechend '5 
einer ersten Transaktion aus der aktiven Transakti- 
onstabelle, nachdem die erste Transaktion quittiert 

ist. 

20 

R even di cations 

1 . Appareil de retablissement d'un traitement de don- 
nees, comprenant : 

25 

une memoire tampon de reprises d'executions 
contenant un ensemble d'enregistrements de 
reprises d'executions, ladite memoire tampon 
de reprises d'execution comprenant des infor- 
mations destinees a des transactions enga- 30 
gees et non engagees ; 
une memoire tampon d'annulations d'execu- 
tions contenant un ensemble d'enregistre- 
ments d'annulations d'executions, ladite me- 
moire tampon d'annulations d'executions com- 35 
prenant des informations destinies seulement 
a une transaction non engaged, lesdits enre- 
gistrements d'annulations d'executions s'agr6- 
geant dans ladite memoire tampon d'annula- 
tions d'executions, separement desdits enre- 40 
gistrements de reprises d'executions de ladite 
memoire tampon de reprises d'executions ; et 
un programme de gestion de listes de controle 
destine a commencer une transaction non en- 
gaged, a enregistrer des enregistrements de 45 
reprises d'executions correspondant a ladite 
transaction non engaged dans ladite memoire 
tampon de reprises d'executions, a enregistrer 
des enregistrements d'annulations d'execu- 
tions pour ladite transaction non engagee dans 50 
ladite memoire tampon d'annulations d'execu- 
tions, a engager ladite transaction, a memori- 
ser lesdits enregistrements de reprises d'exe- 
cutions correspondant a ladite transaction en- 
gagee, de ladite memoire tampon de reprises 55 
d'executions a une memoire r6manente, et a 
rejeter separement lesdits enregistrements 
d'annulations d'executions correspondant a la- 
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dite transaction engagee et provenant de ladite 
memoire tampon d'annulations d'executions 
tout en conservant lesdits enregistrements de 
reprises d'executions dans ladite memoire tam- 
pon de reprises d'executions. 

2. Appareil de retablissement d'un traitement de don- 
nees selon la revendication 1 , comprenant en outre 
une table de transactions active m6morisee dans 
une m6moire et contenant des entrees correspon- 
dant aux transactions qui n'ont pas ete engagees. 

3. Appareil de retablissement d'un traitement de don- 
nees selon la revendication 2, comprenant en outre 
un moyen destine a retirer de ladite table de tran- 
sactions active une entree correspondant a une 
premiere transaction apres que ladite premiere 
transaction ait ete engagee. 

4. Appareil de retablissement d'un traitement de don- 
nees selon la revendication 2, comprenant en outre 
un moyen destine a memoriser le contenu de ladite 
memoire tampon d'annulations d'executions dans 
ladite memoire remanente avant de memoriser des 
changements issus des transactions non enga- 
gees. 

5. Precede de retablissement d'un traitement de don- 
nees, comprenant les etapes conststant : 

a mettre en place une memoire tampon de re- 
prises d'executions contenant un ensemble 
d'enregistrements de reprises d'executions, la- 
dite memoire tampon de reprises d'executions 
comprenant des informations destinees a des 
transactions engagees et non engagees ; 
a mettre en place une memoire tampon d'an- 
nulations d'executions contenant un ensemble 
d'enregistrements d'annulations d'executions, 
ladite memoire tampon d'annulations d'execu- 
tions comprenant des informations destinees 
seulement a une transaction non engagee, les- 
dits enregistrements d'annulations d'execu- 
tions s'agr6geant dans ladite m6moire tampon 
d'annulations d'executions, separement des- 
dits enregistrements de reprises d'executions 
de ladite memoire tampon de reprises 
d'execution ; 

a commencer une transaction non engagee; 
a enregistrer des enregistrements de reprises 
d'executions correspondant a ladite transac- 
tion non engagee dans ladite memoire tampon 
de reprises d'executions ; 
a enregistrer des enregistrements d'annula- 
tions d'executions pour ladite transaction non 
engagee dans ladite m6moire tampon d'annu- 
lations d'executions ; 
a engager ladite transaction ; 
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a memoriser lesdits enregistrements de repri- 
ses d'executions correspondant a ladite tran- 
saction engagee, de ladite memoire tampon de 
reprises d'executions a une memoire 
remanente ; et 5 
a rejeter s6par6ment lesdits enregistrements 
d'annulations d'executions correspondant a la- 
dite transaction engaged et provenant de ladite 
memoire tampon d'annulations d'executions 
tout en conservant lesdits enregistrements de 10 
reprises d'executions dans ladite memoire tam- 
pon de reprises d'executions. 

6. Proc6de selon la revendication 5, comprenant en 
outre r&ape consistant a mettre en place une table is 
de transactions active m£moris6e dans une memoi- 
re et contenant des entries correspondant aux 
transactions qui n'ont pas et£ engagees. 

7. Proc£de selon la revendication 6, comprenant en 20 
outre Petape consistant a retirer de ladite table de 
transactions active une entree correspondant a une 
premiere transaction apres que ladite premiere 
transaction ait ete engaged. 
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5 The present invention generally relates to the logging of transaction records in a 

computer system. More particularly, the present invention relates to the logging of 
transaction records in large-scale transaction-based computer application programs. 

2. Description of the Related Art 

10 

In a transaction-based computer program, it is often advantageous to record the 
steps in a transaction in records, and to write the records to a file on non- volatile storage. 
The process of generating the records and writing them to a file may be commonly called 
logging, event logging, or transaction logging. The file on non-volatile storage may be 

15 commonly called a log file. The records are commonly written to the log file as soon as 
they are created. As used herein, a "transaction" is a series of instructions executed by a 
computer system for carrying out a financial operation. A transaction may include 
multiple steps, and each step may produce one or more records written to the log file. 
Examples of transactions include, but are not limited to, financial transactions such as 

20 deposits, withdrawals, and funds transfers between accounts. Examples of the contents 
of records in a transaction include, but are not limited to, account numbers, deposit 
amounts, account balances, interest rates and calculations. In addition, each record may 
include the time the transaction began, and the time the record was generated. Other 
fields may be included in a record. The fields may include, but are not limited to, a field 

25 indicating which program or program module generated the record, and a field indicating 
which business unit initiated the transaction. As used herein, a "log file" may include 
information relating to transactions which is stored in memory and which may be 
structured into fields, records, and/or other suitable data structures. 
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It may be necessary to periodically unload transaction log records from a log file. 
One reason for the periodic unloading is that transaction log files may grow rapidly, 
especially in a large-scale transaction-based application where applications on several 
servers may be writing records to the log files, so it may be necessary to reduce the size 
5 of the log files. Another reason for the periodic unloading is that the transaction log 
records may require periodic processing for business purposes, such as for account 
balancing in a financial application. The log file unloading may occur at periodic 
intervals ranging from every few minutes to every few days. As used herein, a "logger 
unload program" is a computer program that unloads information relating to transactions 
10 from a log file. 

A demand for processing performance and scalability greater than that provided 
by single- and multi-processor systems led to the development of clusters. In general, a 
cluster is a group of servers that may share resources and cooperate in processing. One 

15 type of cluster is the single-system image cluster. The servers in a single-system image 
cluster appear as one logical system to clients and to application programs running on the 
cluster, hence the name "single-system." Single-system image clusters typically share 
external, non-volatile data storage, such as disk drives. Databases and other types of data 
permanently reside on the external storage. The servers, however, do not generally share 

20 volatile memory. Each server in the cluster operates in a dedicated local memory space. 
Copies of a program may run concurrently on several servers in the cluster. The 
workload may be dynamically distributed among the servers. The copies of the programs 
may appear as one logical program to the client. All servers in the cluster have access to 
all of the data stored in external storage, and a program running on any server in the 

25 cluster may run any transaction. 

The single-system image cluster solves the availability and scalability problems 
and adds a level of stability by the use of redundant systems with no single points of 
failure. Effectively, the (analogical system, may be available year-round to clients and 
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application programs without any outages. Hardware and software maintenance and 
upgrades may be performed without the loss of availability of the cluster and with little or 
no impact to active programs. The combination of availability, scalability, processing 
capability, and the logical system image make the single-system image cluster a powerful 
5 environment on which to base a large-scale transaction-based enterprise server. 

A single-system image cluster may include at least one Coupling Facility (CF) 
which provides hardware and software support for the cluster's data sharing functions. 
The single-system image cluster may also provide a timer facility to maintain time 

10 synchronization among the servers. On such a system, several operating system images 
such as MVS images may be running on at least one computer system. MVS and OS/390 
are examples of mainframe operating systems. OS/390 is a newer version of the MVS 
operating system, and the terms OS/390 and MVS are used interchangeably herein. 
"MVS image" is used synonymously with "server" herein. Operating systems other than 

15 MVS may also run as servers on a single-system image cluster. Each server is allocated 
its own local memory space. The servers appear as one logical server to a client. 
Programs may be duplicated in the memory space of several servers. The workload of a 
program may be divided among several copies of the program running on different 
servers. As in the case with the multiple servers appearing as one logical server, multiple 

20 copies of a program running on a single-system image cluster may appear as one logical 
program to the client. Mainframe operating systems often include a logging utility to 
provide a common, centralized logging function to programs running on a computer 
system. 

25 A common event in a transaction-based computer program is the aborting of a 

transaction. As used herein, "aborting" includes terminating a transaction before all of 
the steps of the transaction have been executed. If transactions are being logged, several 
log records for the transaction may have been stored in a log file at the time of the abort. 
Leaving the log records for an aborted transaction in a log file may cause problems in the 
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processing of the transactions after they are unloaded from the log file. For example, in a 
banking application, a bank may attempt to transfer funds from one account to another 
through an intermediate account. The transaction may first withdraw the funds from a 
first account generating a log record, put the funds in the second account generating a log 
5 record, and then attempt to transfer the funds to the third account. Finding the third 
account closed, the desired action may be to abort the entire transaction and leave the 
funds in the first account. The existence of a withdrawal log record and a deposit log 
record for an aborted transaction in the log file may be problematic for programs 
processing transaction log records from a log file. 

10 

It is therefore desirable to provide a method for indicating a completion status for 
a transaction in transaction log records written to a log file for a large-scale transaction- 
based application. It is also desirable to provide a method for distinguishing between 
aborted and successfully completed transactions as the log records are processed. 

15 

Logging utilities provided by operating systems, such as MVS Logger and 
OS/390 logger, typically do not provide a method for indicating a completion status for 
transactions, and thus do not fully support transaction logging as described herein. The 
system-provided loggers do provide a common, centralized logging function with many 
20 useful features. It is therefore desirable that a method for indicating a completion status 
for a transaction in transaction log records written to a log file be applicable to logging 
utilities provided by operating systems, such as MVS Logger and OS/390 Logger. 

The problem of aborted transactions may also occur in computer systems in 
25 general where a program or programs log transactions or events to log files. Therefore, a 
solution to the aborted transaction problem should preferably be applicable to computer 
programs in general as well as specifically to large-scale transaction-based applications. 
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SUMMARY OF THE INVENTION 



The present invention provides various embodiments of an improved method and 
system for logging transaction records in a computer system. In one embodiment, the 
5 method may include writing a confirmation log record to the log file for a transaction that 
completes normally, and not writing a confirmation log record for transactions that are 
aborted. The log file may be unloaded periodically by an unload program. The unload 
program may write transaction log records accompanied by a confirmation log record to a 
good output file and transaction log records not accompanied by a confirmation log 

10 record to a suspended output file. On a subsequent execution, the unload program may 
combine the log records in the log file and the suspended file. The unload program may 
write transaction log records accompanied by a confirmation log record to a good output 
file. The unload program may write transaction log records not accompanied by a 
confirmation log record and which have not exceeded a transaction time limit to a 

15 suspended output file. The unload program may write transaction log records not 

accompanied by a confirmation log record and which have exceeded a transaction time 
limit to a disposal output file. The transaction log records in the good output file may 
then be processed normally by log processing programs. 

20 In one embodiment, a transaction-based application program (hereinafter referred 

to as "the program") running on a server, may start a first transaction, and the first 
transaction may create a first transaction log record. The first transaction may also create 
a second transaction log record. The generated first and second transaction log records 
may be written to a log file immediately. In one embodiment, more than one program 

25 may be running on a server, the programs may be running transactions, and the 
transactions may be writing log records to a log file. 

At some point, the program completes the first transaction. The program may 
generate a transaction confirmation log record for the first transaction and write the 
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confirmation log record to the log file. The program may then start a second transaction. 
The second transaction may generate a first transaction log record and a second 
transaction log record, and the transaction log records may be written to the log file. The 
program may also start a third transaction, and the third transaction may generate a first 
transaction log record and the transaction log record may be written to the log file. 

Periodically, an unload program unloads the log file. The unload program may 
collect all of a transaction's log records from the log file and examine the records to see if 
a transaction confirmation log record exists for the transaction. The unload program may 
examine the first transaction log records, find the log records for the first transaction, and 
also find the transaction confirmation log record generated for the first transaction. The 
unload program may write the first transaction log records to an output file for completed 
transaction log records. The unload program may also examine the second transaction 
log records, finding the log records generated so far for the second transaction, but not 
finding a transaction confirmation log record for the second transaction. The unload 
program may write the second transaction log records to a suspended file for 
uncompleted transaction log records. The unload program may also examine the third 
transaction log records, finding the log record generated so far for the third transaction, 
but not finding a transaction confirmation log record for the third transaction. The 
unload program may write the third transaction log records to a suspended file for 
uncompleted transaction log records. At this point, the unload of the log file has 
completed. 

At some point, the program completes the second transaction. The program may 
generate a transaction confirmation log record for the second transaction and write it to 
the log file. The third transaction may generate a second transaction log record and write 
it to the log file. 

At some point, the unload program begins the periodic unloading of the entries 
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made in the log file since the last unload, and the entries made in the suspended file 
during the last unload. The unload program may collect all of a transaction's log records 
from the log file and the suspended file and examine the records to see if a transaction 
confirmation log record exists for the transaction. The unload program may examine the 
second transaction log records, finding the log records generated for the second 
transaction, and also finding the transaction confirmation log record generated for the 
second program. The unload program may write the second transaction log records to 
the output file for completed transaction log records. The unload program may also 
examine the third transaction log records, find the log records generated so far for the 
third transaction, but not find a transaction confirmation log record for the third 
transaction. 

The unload program may further examine transaction log records that do not 
include a transaction confirmation log record. The unload program may examine the time 
stamp for the transaction log records. The time stamp may be the start time of the 
transaction that created the transaction log records. The unload program may calculate 
the elapsed time of a transaction by subtracting the start time of the transaction from the 
current system time read from a system clock. The calculated elapsed time of the 
transaction may be compared to a transaction time limit. 

The unload program may examine the third transaction log records, calculate the 
elapsed time of the third transaction, and compare the elapsed time to a transaction time 
limit. The unload program may assume that transaction log records that do not have an 
accompanying confirmation log record, and for which the transaction elapsed time has 
exceeded the transaction time limit, are transaction log records for a transaction that has 
been aborted. Aborted transaction log records are written to a transaction log record 
disposal file. Finding that the third transaction has not exceeded the transaction time 
limit, the unload program may write the third transaction log records to a suspended file 
for uncompleted transaction log records. At this point the unload of the log file and 
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suspended file has completed. 



At some point, the program aborts the third transaction. Significantly, no 
transaction confirmation log record is written for the third transaction. 

5 

At some point, the unload program begins the periodic unloading of the entries 
made in the log file since the last unload, and the entries made in the suspended file 
during previous unloads. The unload program may collect all of a transaction's log 
records from the log file and the suspended file and examine the records to see if a 

10 transaction confirmation log record exists for the transaction. The unload program may 
examine a transaction's log records and find a transaction confirmation log record. The 
unload program may write the transaction log records to the output file for completed 
transaction log records. The unload program may also examine the third transaction log 
records, find the log "record generated for the third transaction that were in the suspended 

15 file, but not find a transaction confirmation log record for the third transaction. 

The unload program may further examine transaction log records that do not 
include a transaction confirmation log record. The unload program may examine the third 
transaction log records, calculate the elapsed time of the third transaction, and compare 

20 the elapsed time to a transaction time limit. The unload program may assume that 

transaction log records that do not have an accompanying confirmation log record, and 
for which the transaction elapsed time has exceeded the transaction time limit, are 
transaction log records for a transaction that has been aborted. Finding that the third 
transaction has exceeded the transaction time limit, the unload program may write the 

25 third transaction log records to a disposal file for aborted transaction log records. The 
unload program may write the transaction log records of transactions that have not 
exceeded the transaction time limit to a suspended file for uncompleted transaction log 
records. At this point the unload of the log file and suspended file has completed. 



Atty. Dkt. No.: 5053-23900 



Page 8 



Conley, Rose & Tayon, P.C. 



One advantage of the method described herein, including writing a confirmation 
log record for a successfully completed transaction and not writing a confirmation log 
record for an aborted transaction, is that the confirmation log record provides positive 
evidence that a transaction has successfully completed during processing of a log file. 
Another advantage is that the method may be used with logger utilities provided with 
operating systems, such as MVS Logger and OS/390 Logger. Yet another advantage is 
that the method may be applied in computer systems in general where programs perform 
event logging. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a server in which programs send log records to a logger 
module; 

Figure 2 illustrates a server with programs sending log records with end records to 
a logger module according to one embodiment; 

Figure 3 illustrates a process of an unload module moving records from a log file 
to an output file; 

Figure 4 illustrates an unload module reading a log file and moving records with 
end records to an output file and records without end records to a suspend file according 
to one embodiment; 

Figure 5 illustrates an unload module reading a log file and a suspend file and 
moving records with end records to an output file, records without end records to a 
suspend file, and timed-out records to a dispose file according to one embodiment; 

Figure 6 is a high-level block diagram of a single-system image cluster system; 

Figure 7a is a flowchart illustrating a process of sorting transaction logs into 
different output categories according to one embodiment; 

Figure 7b is a continuation of flowchart 7a; 

Figure 7c is a continuation of flowchart 7b; 

Figure 7d is a continuation of flowchart 7c. 
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While the invention is susceptible to various modifications and alternative forms, 
specific embodiments thereof are shown by way of example in the drawings and will 
herein be described in detail. It should be understood, however, that the drawings and 
5 detailed description thereto are not intended to limit the invention to the particular form 
disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and 
alternatives falling within the spirit and scope of the present invention as defined by the 
appended claims. 

10 DETAILED DESCRIPTION OF THE DRAWINGS 

The term "computer system" as used herein generally describes the hardware and 
software components that in combination allow the execution of computer programs. 
The computer programs may be stored in software, hardware, or a combination or 

15 software and hardware. A computer system's hardware generally includes a processor, 
memory media, and Input/Output (I/O) devices. As used herein, the term "processor" or 
"processing unit" generally describes the logic circuitry that responds to and processes the 
basic instructions that operate a computer system. The term "memory medium" includes an 
installation medium, e.g., a CD-ROM, or floppy disks; a volatile computer system memory 

20 such as DRAM, SRAM, EDO RAM, Rambus RAM, etc.; or a non- volatile memory such as 
optical storage or a magnetic medium, e.g., a hard drive. The memory medium may 
comprise other types of memory or combinations thereof. In addition, the memory medium 
may be located in a first computer in which the programs are executed, or may be located in 
a second computer connected to the first computer over a network. The term "memory" is 

25 used synonymously with "memory medium" herein. A computer system also generally 
includes a system clock to provide the current time to programs. 

A computer system's software generally includes at least one operating system, a 
specialized software program that manages and provides services to other software 
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programs on the computer system. Software may also include one or more programs to 
perform various tasks on the computer system and various forms of data to be used by the 
operating system or other programs on the computer system. The data may include but are 
not limited to databases, text files, and graphics files. A computer system's software 
generally is stored in non- volatile memory or on an installation medium. A program may be 
copied into a volatile memory when running on the computer system. Data may be read 
into volatile memory as required by a program. Some operating systems may include a 
logging utility to provide a common, centralized logging function to programs running on 
a computer system. 

A computer system may comprise more than one operating system. When there is 
more than one operating system, resources such as volatile and non-volatile memory, 
installation media, and processor time may be shared among the operating systems, or 
specific resources may be exclusively assigned to an operating system. For example, each 
operating system may be exclusively allocated a region of volatile memory. The region of 
volatile memory may be referred to as a "partition" or "memory space." A combination of 
an operating system and assigned or shared resources on a computer system may be referred 
to as a "server." A computer system thus may include one or more servers. 

Figure 1 - A server in which programs send log records to a logger module 

Figure 1 illustrates a server 10 including a system memory 20 connected to a data 
storage 30 by a data bus 35, a system clock 15, and a processing unit 12 connected to 
system memory 20 and system clock 15. Processing unit 12 may include a single 
processor or several processors performing in parallel. Processing unit may be coupled to 
data storage 30 by a data bus. A log file 40 may be stored on the data storage 30 and may 
be maintained by logger module 50. A program 60 and a program 70, running on server 
10, may run transactions that generate transaction log records 65 and 75. Program 60 and 
program 70 may send the log records to logger module 50, which then may write the 
transaction log records to log files 40. Programs 60 and 70 do not send a transaction log 
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record indicating the end of a transaction to logger module 50. The system clock 15 may 
be used in generating the current time and/or creating time stamps for log records. As 
used herein, a "time stamp" is a record of the time at which part of a log file was written 
or a record of the time at which a transaction or a step in a transaction was generated. 

5 

Figure 2 - A server with programs sending log records with end records to a logger 
module according to one embodiment 

Figure 2 illustrates a server 10 including a system memory 20 connected to a data 
storage 30 by a data bus 35, a system clock 15, and a processing unit 12 connected to 

10 system memory 20 and system clock 15. Processing unit 12 may include a single 

processor or several processors performing in parallel. Processing unit may be coupled to 
data storage 30 by a data bus. A log file 40 may be stored on the data storage 30 and may 
be maintained by logger module 50. A program 60 running on server 10 may run a 
transaction that may generate a set of transaction log records 65. Program 60 may send 

15 the log records to logger module 50, which then may write the transaction log records 65 
to log files 40. When the transaction generating transaction log records 65 ends in 
program 60, program 60 generates a transaction end record 66 and sends it to logger 
module 50. Logger module 50 then may write transaction end record 66 to log files 40. 
Similarly, a program 70 running on server 10 may run a transaction that may generate a 

20 set of transaction log records 75. Program 70 may send the log records to logger module 
50, which then may write the transaction log records 75 to log files 40. When the 
transaction generating transaction log records 75 ends in program 60, program 60 
generates a transaction end record 76 and sends it to logger module 50. Logger module 
50 then may write transaction end record 76 to log files 40. The system clock 15 may be 

25 used in generating the current time and/or creating time stamps for log records. 

Figure 3 - A process of an unload module moving records from a log file to an output 
file 

Figure 3 illustrates an unload module 110 extracting transaction log records 120 
30 from a log file 100 and moving them to an output file 125. Note that the unload module 
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110 has no mechanism for determining if transaction log records 120 is complete, so all 
of transaction log records 120 are moved into output files 125 

Figure 4 - An unload module reading a log file and moving records with end records to 
5 an output file and records without end records to a suspend file according to one 
embodiment 

Figure 4 illustrates an unload module 160 extracting transaction log records from 
a log file 150 and sorting the transaction log records based upon the presence of a 
transaction end record. Unload module 160 comprises a logger unload program. Unload 

10 module 160 may perform periodic unloading of log files on a computer system. At the 
time unload module 160 performs the unload of the records from log file 150, a 
transaction generating transaction log records 170 may have completed. A transaction 
end record 171 may have been generated in response to the completion of the transaction 
and written to the log file. A "transaction end record" is used herein as a synonym for a 

15 "completion record," i.e., a sequence of characters or binary data indicating that a 

transaction has completed. Unload module 160 may read one or more transaction records 

170 from log file 150. Unload module 160 may then detect the transaction end record 

171 and write the transaction records 170 to an output file 175 in response to detecting 
the transaction end record 171 . In one embodiment, an unload module may dispose of a 

20 transaction end record after the transaction end record is used. In another embodiment, 
an unload module may send a transaction end record to an output file with the rest of the 
transaction records. 

Unload module 160 also may read one or more transaction records 180 from log 
file 150. When all records in the log file 100 have been processed by unload module 160 
25 and no transaction end record is found for transaction records 180, unload module 160 
may write transaction records 180 to a suspend file 185. 
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Figure 5 - An unload module reading a log file and a suspend file and m oving records 
with end records to an output file, records without end records to a suspend file, and 
timed-out records to a dispose file according to one embodiment 

Figure 5 illustrates an unload module 160 extracting transaction log records from 

5 a log file 200 and a suspend file 201 and sorting the transaction log records based upon 
the presence of a transaction end record and a time limit for incomplete transactions. At 
the time the unload module 160 performs the unload of the transaction log records from 
log file 200 and suspend file 201, a transaction generating transaction log records 210 
may have completed. Transaction end record 21 1 may have been generated and written 

1 0 to log file 200 upon completion of the transaction. Another transaction may have started 
and written transaction log records 220 to log file 200. Transaction log records 212 and 
transaction log records 230 may have been written to a suspend file by an earlier unload 
of a log file by unload module 160. After the earlier unload, additional transaction log 
records 212 may have been written to log file 200, a transaction generating the 

1 5 transaction log records 2 1 2 may have completed, and transaction end record 2 1 3 may 
have been written to log file 200. 

Unload module 160 may read one or more transaction records 210 from log file 
200. Unload module 160 may then detect the transaction end record 21 1 and write the 
transaction records 210 to one of output files 215 in response to detecting the transaction 

20 end record 211. In this case, the output file is a completed transaction file. As used 
herein, a "completed transaction file" may include a file in memory which stores 
information relating to completed transactions. Output files may also include an 
uncompleted transaction file and an aborted transaction file. As used herein, an 
"uncompleted transaction file" may include a file in memory which stores information 

25 relating to uncompleted transactions. As used herein, an "aborted transaction file" may 
include a file in memory which stores information relating to aborted transactions. 
Unload module 160 may also read one or more transaction records 212 from suspend file 
201 and one or more transaction records 212 including transaction end record 213 from 
log file 200. Unload module 160 may then detect the transaction end record 213 and 
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write the transaction records 212 to one of output files 215 in response to detecting the 
transaction end record 213. 

When all records in the log file 200 have been read by unload module 160, unload 
module 160 may examine transaction records that have no transaction end record 
5 associated with them. No transaction end record is found for transaction records 220 and 
230. Unload module 160 may then examine transaction records 220 and determine that 
the transaction has not exceeded a transaction time limit 16. Unload module 160 may 
subtract a time stamp in transaction record 220 from the system time read from a system 
clock 15 to determine the transaction elapsed time. Unload module 160 may then write 
10 transaction records 220 to a suspend file 225. Unload module 160 may then examine 
transaction records 230 and determine that the transaction has exceeded the transaction 
time limit. Unload module 160 may then write transaction records 230 to a dispose file 
235. 

15 Figure 6 - A high-level block diagram of a single-system image cluster system 

Figure 6 illustrates an embodiment of a single-system image cluster system that is 
suitable for implementing the logging system and method as described herein. The 
system may include multiple systems (two systems, systems 300 and 310, are shown) 
running mainframe operating systems such as OS/390 or MVS operating systems; at least 

20 one coupling facility 330 to assist in multisystem data sharing functions, wherein the 
coupling facility 330 is physically connected to systems in the cluster with high-speed 
coupling links 335; a timer facility 340 to synchronize time functions among the servers; 
and various storage and I/O devices 320, such as DASD (Direct Access Storage Devices), 
tape drives, terminals, and printers, connected to the systems by data buses or other 

25 physical communication links. 

Shown in system 300 is a server 350 running a mainframe operating system such 
as MVS. A logger 351 is shown running on server 350. The logger 351 accepts records 
to be logged from program 352 and writes them to a log file 321 shown on external 
storage. Also shown in system 3 10 is a system partitioned into more than one logical 

30 system or server (two servers, servers 360 and 370, both running a mainframe operating 
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system such as MVS, are shown on system 310). Servers 360 and 370 may also have 
programs interfacing with a logger. 

The single-system image cluster system provides dynamic workload balancing 
among the servers in the cluster. To a client working at a terminal or to an application 
5 program running on the cluster, the servers and other hardware and software in a single- 
system image cluster system appear as one logical system. 

Figures 7a-7d - A flowchart illustrating a process of sorting transaction logs into different 
output categories according to one embodiment 

10 Figures 7a through 7d present a flowchart illustrating one embodiment of a 

method of transaction logging providing the ability to sort transaction logs into different 
output categories. The flowchart may describe a logging process similar to that shown in 
Figure 2. At step 400, a log file is opened. Opening a log file may include creating a new 
log file, or it may include opening a previously created log file. In one embodiment, the 

15 log file may be created on an external storage device such as a disk drive. In another 
embodiment, the log file may be created in a volatile memory and later written to a non- 
volatile storage. In one embodiment, an application program running on a server may 
create the log file. In another embodiment, a system logging utility may provide log file 
creation and maintenance to application programs running on the server. In yet another 

20 embodiment, a logger interface program may provide log file creation and maintenance 
to application programs, and may interface to a system logging utility. As used herein, a 
"logger interface program" includes a program which is configured to accept transactions 
generated by programs and send the transactions to a system logging utility. 

At step 401, a first transaction generates a first transaction log record. At step 

25 402, the first transaction generates a second transaction log record. In this flowchart, 
generating a transaction log record may include the creation of the transaction log record 
and writing the transaction log record to a log file. In one embodiment, a program may 
directly write a transaction log record to a log file. In another embodiment, a program 
may send a transaction log record to a logging program and the logging program may 
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write the transaction log record to a log file. In yet another embodiment, a program may 
send a transaction log record to a system logging utility and the system logging utility 
may write the transaction log record to a log file. In one embodiment, a transaction may 
include a number of steps, wherein each step in the transaction may generate one or more 
5 transaction log records. 

In step 403, the first transaction may complete. The program may generate a 
transaction confirmation log record for the first transaction in step 404. The terms 
transaction confirmation log record, transaction end record, and transaction completion 
record are synonymous as used herein. As used herein, a "transaction completion record" 

10 may include information, such as a sequence of characters or binary data, indicating that 
a transaction has completed. A second transaction may generate a first transaction log 
record in step 405, and a second transaction log record in step 406. In step 407, a third 
transaction may generate a first transaction log record. 

In step 408, an unload program as illustrated in Figure 6 begins unloading the log 

15 file. In step 409, the unload program may collect all of a transaction's log records from 
the log file and examine the records to see if a transaction confirmation log record exists 
for the transaction. In step 409, the unload program may examine the first transaction log 
records, finding the log records generated in steps 401 and 402, and also finding the 
transaction confirmation log record generated in step 404. The unload program may 

20 write the first transaction log records to an output file for completed transaction log 

records in step 410. In one embodiment, all of a completed transaction's log records may 
be written to a completed transaction output file. In another embodiment, transaction 
confirmation log records are deleted after they are used to identify completed transactions 
and are not written to a completed transaction output file. In step 409, the unload 

25 program may also examine the second transaction log records, finding the log records 
generated in steps 405 and 406, but not finding a transaction confirmation log record. 
The unload program may write the second transaction log records to a suspended file for 
uncompleted transaction log records in step 41 1. In step 409, the unload program may 
also examine the third transaction log records, finding the log record generated in steps 
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407, but not finding a transaction confirmation log record. The unload program may 
write the third transaction log records to a suspended file for uncompleted transaction log 
records in step 412. In step 413, the unload of the log file is completed. 

In step 414, the second transaction may complete. The program may generate a 
transaction confirmation log record for the second transaction in step 415. In step 416, 
the third transaction may generate a second transaction log record. 

In step 417, the unload program begins unloading the entries made in the log file 
since the last unload, and the entries made in the suspended file during the last unload. In 
step 418, the unload program may collect all of a transaction's log records from the log 
file and the suspended file and examine the records to see if a transaction confirmation 
log record exists for the transaction. In step 41 8, the unload program may examine the 
second transaction log records, finding the log records generated in steps 405 and 406, 
and also finding the transaction confirmation log record generated in step 415. The 
unload program may write the second transaction log records to the output file for 
completed transaction log records in step 419. In step 418, the unload program may also 
examine the third transaction log records, find the log records generated in steps 407 and 
416, but not find a transaction confirmation log record. 

In step 420, the unload program may further examine a transaction's log records 
that do not include a transaction confirmation log record. In one embodiment, a 
transaction log record may include at least one time stamp. A transaction start time is 
one example of a time stamp, wherein a transaction start time may indicate the time at 
which a transaction was generated or written to a log file. In one embodiment of a 
transaction log record including a time stamp, the time stamp may be represented as a 
text representation of a year, month, day of the month, hour, minute, seconds, and 
fractions of seconds. In another embodiment of a transaction log record including a time 
stamp, the time stamp may be represented as a binary number, and the binary number 
may represent a number of fractions of a second since a system-determined base time. 
Other methods of representing a time stamp in a transaction log record will be obvious to 
one skilled in the art. In one embodiment, a transaction that generates transaction log 
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records may use the start time of the transaction as a time stamp for transaction log 
records. In one embodiment, a time stamp used by a transaction may be unique for that 
transaction and may be used to uniquely identify the transaction. In step 420, the unload 
program may examine the time stamp for the transaction log records. The time stamp 
5 may be the start time of the transaction that generated the transaction log records. The 
unload program may calculate the elapsed time of a transaction by subtracting the start 
time of the transaction from the current system time read from a system clock (see Figure 
5, item 15). The calculated elapsed time of the transaction may be compared to a 
transaction time limit (see Figure 5, item 16). In one embodiment, a transaction time 
10 limit for a transaction log record may be read from a program that generated the 

transaction log record. In another embodiment, a transaction time limit may be set in the 
unload program. In yet another embodiment, a transaction time limit may be entered by a 
user of the unload program before transaction log records in a log file and suspended file 
are processed. 

15 In step 420, the unload program may examine the third transaction log records, 

calculate the elapsed time of the third transaction, and compare the elapsed time to a 
transaction time limit. In step 420, the unload program may assume that a transaction 
having transaction log records that do not have an accompanying confirmation log 
record, and for which the transaction elapsed time has exceeded the transaction time 

20 limit, has been aborted. Aborted transaction log records are written to a transaction log 
record disposal file in step 421. In one embodiment the disposal log file may be kept 
after the unload program completes. In another embodiment, the record disposal file is 
deleted by the unload program before the unload program completes. Finding that the 
third transaction has not exceeded the transaction time limit, the unload program may 

25 write the third transaction log records to a suspended file for uncompleted transaction log 
records in step 422. In step 423, the unload program finishes the unload of the log file 
and suspended file. 

In step 424, the third transaction is aborted. No transaction confirmation log 
record is written for the third transaction. 
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In step 425, the unload program begins unloading the entries made in the log file 
since the last unload, and the entries made in the suspended file during previous unloads. 
In step 426, the unload program may collect all of a transaction's log records from the log 
file and the suspended file and examine the records to see if a transaction confirmation 
5 log record exists for the transaction. In step 427, the unload program may examine a 
transaction's log records and find a transaction confirmation log record. The unload 
program may write the transaction log records to the output file for completed transaction 
log records in step 427. In step 426, the unload program may also examine the third 
transaction log records, find the log record generated in steps 407 and 416, but not find a 

10 transaction confirmation log record. 

In step 428, the unload program may further examine a transaction's log records 
that do not include a transaction confirmation log record. The unload program may 
examine the third transaction log records, calculate the elapsed time of the third 
transaction, and compare the elapsed time to a transaction time limit. In step 420, the 

15 unload program may assume that a transaction having transaction log records that do not 
have an accompanying confirmation log record, and for which the transaction elapsed 
time has exceeded the transaction time limit, has been aborted. Finding that the third 
transaction has exceeded the transaction time limit, the unload program may write the 
third transaction log records to a disposal file for aborted transaction log records in step 

20 429. The unload program may write the transaction log records of transactions that have 
not exceeded the transaction time limit to a suspended file for uncompleted transaction 
log records in step 430. In step 431, the unload program finishes the unload of the log 
file and suspended file. 

25 Various embodiments further include receiving or storing instructions and/or data 

implemented in accordance with the foregoing description upon a carrier medium. 
Suitable carrier media include memory media or storage media such as magnetic or 
optical media, e.g., disk or CD-ROM, as well as signals such as electrical, 
electromagnetic, or digital signals, conveyed via a communication medium such as 
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networks and/or a wireless link. 

Although the system and method of the present invention have been described in 
connection with several embodiments, the invention is not intended to be limited to the 
specific forms set forth herein, but on the contrary, it is intended to cover such 
alternatives, modifications, and equivalents as can be reasonably included within the 
spirit and scope of the invention as defined by the appended claims. 
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WHAT IS CLAIMED IS: 



1 . A method comprising: 

writing a first transaction to a log file; 
completing the first transaction; 

writing a completion record for the first transaction to the log file, wherein 
the completion record indicates that the first transaction has completed; 

writing a second transaction to the log file; 

reading the completed first transaction from the log file; 

reading the second transaction from the log file; 

writing the completed first transaction to a completed transaction file; and 
writing the second transaction to an uncompleted transaction file. 

2. The method of claim 1, further comprising: 

completing the second transaction; 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 
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reading contents of the log file and contents of the uncompleted 
transaction file, wherein the contents of the log file include the completion record 
for the second transaction, and wherein the contents of the uncompleted 
transaction file include the second transaction; 

determining that the second transaction from the uncompleted transaction 
file corresponds to the completion record for the second transaction from the log 
file; and 

writing the completed second transaction to the completed transaction file 
in response to determining that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second transaction 
from the log file. 

3. The method of claim 2, further comprising: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 

providing a current time; 

writing a third transaction to the log file, wherein the third transaction 
includes a transaction start time, wherein the transaction start time indicates a 
time at which the third transaction began; 

reading contents of the log file and contents of the uncompleted 
transaction file; 
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calculating a transaction elapsed time for the third transaction by 
subtracting the current time from the transaction start time; 

comparing the transaction elapsed time of the third transaction to the 
transaction time limit; and 

writing the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the transaction time 
limit. 

4. The method of claim 1, wherein each transaction comprises at least one 
transaction record, and wherein a transaction record comprises at least one field. 

5. The method of claim 4, wherein each transaction record further comprises an 
identifier field, wherein the identifier field is unique to the transaction, such that 
the identifier field uniquely identifies a transaction record as belonging to a 
particular transaction, and such that the identifier field is used to distinguish the 

- particular transaction from other transactions. 

6. The method of claim 5, wherein each transaction record further comprises a time 
field, wherein the time field comprises a time stamp. 

7. The method of claim 6, wherein the identifier field is the time field, wherein the 
time stamp is a time at which the transaction was started. 

8. The method of claim 7, wherein one transaction record is a completion record 
comprising a time field and a transaction complete field, wherein the transaction 
complete field includes information identifying a record as a completion record. 
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9. The method of claim 1 , wherein each transaction includes a program identifier, 
wherein the program identifier is a unique piece of data that indicates which 
program of a plurality of programs generated the transaction. 

5 10. The method of claim 9, wherein a logger interface program is configured to 

accept transactions generated by programs and send the transactions to a system 
logger. 

11. The method of claim 10, wherein the system logger is a component of an 
10 operating system. 

12. The method of claim 1 0, further comprising: 

the logger interface program receiving the first transaction from a first 
15 program; 

the logger interface program sending the first transaction to the system 

logger; 

20 the logger interface program receiving the second transaction from a 

second program; 

the logger interface program sending the second transaction to the system 

logger. 



25 



13. The method of claim 1 2, wherein the logger interface program is executable by a 
mainframe computer system. 

14. The method of claim 1 , 
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wherein reading transactions from the log file further comprises a logger 
unload program reading transactions from the log file; 

wherein writing the completed first transaction to a completed transaction 
file further comprises the logger unload program writing the completed first 
transaction to a completed transaction file; and 

wherein writing the second transaction to an uncompleted transaction file 
further comprises the logger unload program writing the second transaction to an 
uncompleted transaction file. 

15. The method of claim 14, further comprising: 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

the logger unload program reading contents of the log file and contents of 
the uncompleted transaction file; 

the logger unload program determining that the second transaction from 
the uncompleted transaction file corresponds to the completion record for the 
second transaction from the log file; and 

the logger unload program writing the completed second transaction to the 
completed transaction file in response to the logger unload program determining 
that the second transaction from the uncompleted transaction file corresponds to 
the completion record for the second transaction from the log file. 
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16. The method of claim 15, further comprising: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 

providing a current time; 

writing a third transaction to the log file, wherein the third transaction 
includes a transaction start time, wherein the transaction start time indicates a 
time at which the third transaction began; 

the logger unload program reading contents of the log file and contents of 
the uncompleted transaction file; 

the logger unload program calculating a transaction elapsed time for the 
third transaction by subtracting the current time from the transaction start time; 

the logger unload program comparing the transaction elapsed time of the 
third transaction to the transaction time limit; and 

the logger unload program writing the third transaction to an aborted 
transaction file when the transaction elapsed time of the third transaction has 
exceeded the transaction time limit. 

17. A method for logging transactions in a computer system, the method comprising: 

starting a first transaction; 
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writing a first record for the first transaction to a log file; 
starting a second transaction; 

writing a first record for the second transaction to the log file; 
starting a third transaction; 

writing a first record for the third transaction to the log file; 
completing the first transaction; 

writing a completion record for the first transaction to the log file, wherein 
the completion record indicates that the first transaction has completed; 

reading the first record for the first transaction, the first record for the 
second transaction, the first record for the third transaction, and the completion 
record for the first transaction from the log file; 

writing the completed first transaction to a completed transaction file; and 

writing the second transaction and the third transaction to an uncompleted 
transaction file. 

18. The method of claim 17, further comprising: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 
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providing a current time; 
completing the second transaction; 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

aborting the third transaction; 

reading contents of the log file and contents of the uncompleted 
transaction file; 

determining that the second transaction from the uncompleted transaction 
file corresponds to the completion record for the second transaction from the log 
file; 

writing the completed second transaction to the completed transaction file 
in response to determining that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second transaction 
from the log file; 

calculating a transaction elapsed time for the third transaction by 
subtracting a time stamp of the first record of the third transaction from the 
current time; 

comparing the transaction elapsed time of the third transaction to the 
transaction time limit; and 
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writing the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the transaction time 
limit. 



19. A system comprising : 

a processing unit; 

a system memory coupled to the processing unit; 

a data storage coupled to the processing unit; 

wherein the system memory stores program instructions, wherein the 
program instructions are executable by the processing unit to: 

write a first transaction to a log file in the data storage; 

write a completion record for the first transaction to the log file, 
wherein the completion record indicates that the first transaction has 
completed; 

write a second transaction to the log file; 

read the completed first transaction from the log file; 

read the second transaction from the log file; 

write the completed first transaction to a completed transaction file 
in the data storage; and 
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write the second transaction to an uncompleted transaction file in 
the data storage. 

20. The system of claim 19, wherein the program instructions are further executable 
to: 

complete the second transaction; 

write a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

read contents of the log file and contents of the uncompleted transaction 
file, wherein the contents of the log file include the completion record for the 
- - second transaction, and wherein the contents of the uncompleted transaction file 
include the second transaction; 

determine that the second transaction from the uncompleted transaction 
file corresponds to the completion record for the second transaction from the log 
file; and 

write the completed second transaction to the completed transaction file in 
response to determining that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second transaction 
from the log file. 

2 1 . The system of claim 20, further comprising: 
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a system clock coupled to the processing unit; 

wherein the program instructions are further executable to: 

providing a transaction time limit for transactions, wherein the 
transaction time limit indicates a time by which the transaction must 
complete to be valid; 

determine a current time by reading the system clock; 

write a third transaction to the log file, wherein the third 
transaction includes a transaction start time, wherein the transaction start 
time indicates a time at which the third transaction began; 

read contents of the log file and contents of the uncompleted 
transaction file; 

calculate a transaction elapsed time for the third transaction by 
subtracting the current time from the transaction start time; 

compare the transaction elapsed time of the third transaction to the 
transaction time limit; and 

write the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the 
transaction time limit. 

22. The system of claim 19, wherein each transaction comprises at least one 

transaction record, and wherein a transaction record comprises at least one field. 
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23. The system of claim 22, wherein each transaction record further comprises an 
identifier field, and wherein the identifier field is unique to the transaction, such 
that the identifier field uniquely identifies a transaction record as belonging to a 
particular transaction, and such that the identifier field is used to distinguish the 
particular transaction from other transactions. 

24. The system of claim 23, wherein each transaction record further comprises a time 
field, and wherein the time field comprises a time stamp. 

25. The system of claim 24, wherein the identifier field is the time field, and wherein 
the time stamp is a time at which the transaction was started. 

26. The system of claim 25, wherein one transaction record is a completion record 
comprising a time field and a transaction complete field, wherein the transaction 
complete field includes information identifying a record as a completion record. 

27. The system of claim 19, wherein each transaction includes a program identifier, 
wherein the program identifier is a unique piece of data that indicates which 
program of a plurality of programs generated the transaction. 

28. The system of claim 27, wherein a logger interface program is configured to 
accept transactions generated by programs and send the transactions to a system 
logger. 

29. The system of claim 28, wherein the system logger is a component of an 
operating system. 

30. The system of claim 28, 



Atty. Dkt. No.: 5053-23900 



Page 33 



Conley, Rose & Tayon, P.C. 



wherein the program instructions further comprise a first application 
program and a second application program; 

wherein the logger interface program is executable to receive the first 
transaction from a first program; 

wherein the logger interface program is executable to send the first 
transaction to the system logger; 

wherein the logger interface program is executable to receive the second 
transaction from a second program; 

wherein the logger interface program is executable to send the second 
transaction to the system logger; 

wherein the system logger is executable to write the first transaction to the 
log file and write the second transaction to the log file. 

3 1 . The system of claim 19, wherein the system is a mainframe computer system. 

32. A system comprising: 

a processing unit; 

a system memory coupled to the processing unit; 
a data storage coupled to the processing unit; 
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wherein the system memory stores program instructions, wherein the 
program instructions are executable by the processing unit to: 

start a first transaction; 

write a first record for the first transaction to a log file in the data 
storage; 

start a second transaction; 

write a first record for the second transaction to the log file; 
start a third transaction; 

write a first record for the third transaction to the log file; 
complete the first transaction; 

write a completion record for the first transaction to the log file, 
wherein the completion record indicates that the first transaction has 
completed; 

reading the first record for the first transaction, the first record for 
the second transaction, the first record for the third transaction, and the 
completion record for the first transaction from the log file into the system 
memory; 

write the completed first transaction to a completed transaction file 
in the data storage; and 
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write the second transaction and the third transaction to an 
uncompleted transaction file in the data storage. 

33. The system of claim 32, further comprising: 

a system clock coupled to the processing unit; 

wherein the program instructions are further executable to: 

providing a transaction time limit for transactions, wherein the 
transaction time limit indicates a time by which the transaction must 
complete to be valid; 

determine a current time by reading the system clock; 

complete the second transaction; 

write a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

abort the third transaction; 

read contents of the log file and contents of the uncompleted 
transaction file into the system memory; 
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determine that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second 
transaction from the log file; 

write the completed second transaction to the completed 
transaction file in response to determining that the second transaction from 
the uncompleted transaction file corresponds to the completion record for 
the second transaction from the log file; 

calculate a transaction elapsed time for the third transaction by 
subtracting a time stamp of the first record of the third transaction from 
the current time; 

compare the transaction elapsed time of the third transaction to the 
transaction time limit; and 

write the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the 
transaction time limit. 

34. A carrier medium comprising program instructions, wherein the program 
instructions are executable by a machine to implement: 

writing a first transaction to a log file; 

completing the first transaction; 

writing a completion record for the first transaction to the log file, wherein 
the completion record indicates that the first transaction has completed; 
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writing a second transaction to the log file; 

reading the completed first transaction from the log file; 

reading the second transaction from the log file; 

writing the completed first transaction to a completed transaction file; and 

writing the second transaction to an uncompleted transaction file. 

35. The carrier medium of claim 34, wherein the program instructions are further 
executable by the machine to implement: 

completing the second transaction; 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

reading contents of the log file and contents of the uncompleted 
transaction file, wherein the contents of the log file include the completion record 
for the second transaction, and wherein the contents of the uncompleted 
transaction file include the second transaction; 

determining that the second transaction from the uncompleted transaction 
file corresponds to the completion record for the second transaction from the log 
file; and 
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writing the completed second transaction to the completed transaction file 
in response to determining that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second transaction 
from the log file. 

36. The carrier medium of claim 35, wherein the program instructions are further 
executable by the machine to implement: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 

providing a current time; 

writing a third transaction to the log file, wherein the third transaction 
includes a transaction start time, wherein the transaction start time indicates a 
time at which the third transaction began; 

reading contents of the log file and contents of the uncompleted 
transaction file; 

calculating a transaction elapsed time for the third transaction by 
subtracting the current time from the transaction start time; 

comparing the transaction elapsed time of the third transaction to the 
transaction time limit; and 

writing the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the transaction time 
limit. 
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37. 



The carrier medium of claim 34, wherein each transaction comprises at least one 
transaction record, and wherein a transaction record comprises at least one field. 



38. The carrier medium of claim 37, wherein each transaction record further 
comprises an identifier field, and wherein the identifier field is unique to the 
transaction, such that the identifier field uniquely identifies a transaction record as 
belonging to a particular transaction, and such that the identifier field is used to 
distinguish the particular transaction from other transactions. 

39. The carrier medium of claim 38, wherein each transaction record further 
comprises a time field, wherein the time field comprises a time stamp. 

40. The carrier medium of claim 39, wherein the identifier field is the time field, 
wherein the time stamp is a time at which the transaction was started. 

41 . The carrier medium of claim 40, wherein one transaction record is a completion 
record comprising a time field and a transaction complete field, wherein the 
transaction complete field includes information identifying a record as a 
completion record. 

42. The carrier medium of claim 34, wherein each transaction includes a program 
identifier, wherein the program identifier is a unique piece of data that indicates 
which program of a plurality of programs generated the transaction. 

43. The carrier medium of claim 42, wherein a logger interface program is configured 
to accept transactions generated by programs and send the transactions to a 
system logger. 
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44. The carrier medium of claim 43, wherein the system logger is a component of an 
operating system. 

45. The carrier medium of claim 43, wherein the program instructions are further 
executable by the machine to implement: 

the logger interface program receiving the first transaction from a first 
program; 

the logger interface program sending the first transaction to the system 

logger; 

the logger interface program receiving the second transaction from a 
second program; 

the logger interface program sending the second transaction to the system 

logger. 

46. The carrier medium of claim 34, wherein the machine is a mainframe computer 
system. 

47. The carrier medium of claim 34, 

wherein reading transactions from the log file comprises a logger unload 
program executable for reading transactions from the log file; 

wherein writing the completed first transaction to a completed transaction 
file comprises the logger unload program executable for writing the completed 
first transaction to a completed transaction file; and 
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wherein writing the second transaction to an uncompleted transaction file 
comprises the logger unload program executable for writing the second 
transaction to an uncompleted transaction file. 

48. The carrier medium of claim 47, wherein the program instructions are further 
executable by the machine to implement: 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

the logger unload program reading contents of the log file and contents of 
the uncompleted transaction file; 

the logger unload program determining that the second transaction from 
the uncompleted transaction file corresponds to the completion record for the 
second transaction from the log file; and 

the logger unload program writing the completed second transaction to the 
completed transaction file in response to the logger unload program determining 
that the second transaction from the uncompleted transaction file corresponds to 
the completion record for the second transaction from the log file. 

49. The carrier medium of claim 48, wherein the program instructions are further 
executable by the machine to implement: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 
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providing a current time; 

writing a third transaction to the log file, wherein the third transaction 
includes a transaction start time, wherein the transaction start time indicates a 
time at which the third transaction began; 

the logger unload program reading contents of the log file and contents of 
the uncompleted transaction file; 

the logger unload program calculating a transaction elapsed time for the 
third transaction by subtracting the current time from the transaction start time; 

the logger unload program comparing the transaction elapsed time of the 
third transaction to the transaction time limit; and 

the logger unload program writing the third transaction to an aborted 
transaction file when the transaction elapsed time of the third transaction has 
exceeded the transaction time limit. 

50. The carrier medium of claim 34, wherein the carrier medium is a memory 
medium. 

51. A carrier medium comprising program instructions, wherein the program 
instructions are executable by a machine to implement: 

starting a first transaction; 

writing a first record for the first transaction to a log file; 
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starting a second transaction; 

writing a first record for the second transaction to the log file; 

5 

starting a third transaction; 

writing a first record for the third transaction to the log file; 

10 completing the first transaction; 

writing a completion record for the first transaction to the log file, wherein 
the completion record indicates that the first transaction has completed; 



15 



20 



reading the first record for the first transaction, the first record for the 
second transaction, the first record for the third transaction, and the completion 
record for the first transaction from the log file; 

writing the completed first transaction to a completed transaction file; and 

writing the second transaction and the third transaction to an uncompleted 
transaction file. 

52. The carrier medium of claim 5 1 , wherein the program instructions are further 
25 executable by the machine to implement: 

providing a transaction time limit for transactions, wherein the transaction 
time limit indicates a time by which the transaction must complete to be valid; 
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providing a current time; 
completing the second transaction; 

writing a completion record for the second transaction to the log file, 
wherein the completion record indicates that the second transaction has 
completed; 

aborting the third transaction; 

reading contents of the log file and contents of the uncompleted 
transaction file; 

determining that the second transaction from the uncompleted transaction 
file corresponds to the completion record for the second transaction from the log 
file; 

writing the completed second transaction to the completed transaction file 
in response to determining that the second transaction from the uncompleted 
transaction file corresponds to the completion record for the second transaction 
from the log file; 

calculating a transaction elapsed time for the third transaction by 
subtracting a time stamp of the first record of the third transaction from the 
current time; 

comparing the transaction elapsed time of the third transaction to the 
transaction time limit; and 
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writing the third transaction to an aborted transaction file when the 
transaction elapsed time of the third transaction has exceeded the transaction time 
limit. 

53. The carrier medium of claim 51, wherein the carrier medium is a memory 
medium. 
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ABSTRACT OF THE DISCLOSURE 



An improved method and system for logging transaction records in a computer 
system. The method may include writing a confirmation log record to the log file for a 
transaction that completes normally, and not writing a confirmation log record for 
transactions that are aborted. The log file may be unloaded periodically by an unload 
program. The unload program may write transaction log records accompanied by a 
confirmation log record to a good output file and transaction log records not accompanied 
by a confirmation log record to a suspended output file. On a subsequent execution, the 
unload program may combine the log records in the log file and the suspended file. The 
unload program may write transaction log records accompanied by a confirmation log 
record to a good output file. The unload program may write transaction log records not 
accompanied by a confirmation log record and which have not exceeded a transaction 
time limit to a suspended output file. The unload program may write transaction log 
records not accompanied by a confirmation log record and which have exceeded a 
transaction time limit to a disposal output file. The transaction log records in the good 
output file may then be processed normally by log processing programs. 
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